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ABSTRACT 


'  Procedures  of  statistical  inference  are  described  which 
generalize  Bayesian  inference  in  specific  ways.  Probability 
is  used  in  such  a  way  that  in  general  only  bounds  may  be 
placed  on  the  probabilities  of  given,  events ,  and  probability 
systems  of  this  kind  are  suggested  both  for  sample  information 
and'  for  prior  information.  These  systems  are  then  combined 
using,  a  specified  rule;  Illustrations  are  given  for  inferences 
dbout  trinomial  probabilities ,  and  for  inferences  about  a  mono¬ 
tone  sequence  of  binomial  p^.  Finally,,  some  comments  are  made 
on  the  general  class  of  models  which  produce  upper  and  lower 
probabilities 3  and  on  the  specific  models  which  underlie  the 
suggested  inference  procedures. 


1.  INTRODUCTION 

Reduced  to  its  mathematical  essentials,  Bayesian  inference 
means  starting  with  a  global  probability  distribution  for  all 
relevant  variables,  observing  the  values  of  some  of  these  vari¬ 
ables,  and  quoting  the  conditional  distribution  of  the  remaining 
variables  given  the  observations.  In  the  generalization  of  this 
paper,  something  less  than  a  global  probability  distribution  is 
required,  while  the  basic  device  of  conditioning  on  observed  data 
is  retained.  Actually,  the  generalization  is  more  specific.  The 
term  Bayesian  commonly  implies  a  global  probability  law  given  in 
two  parts,  first  the  marginal  distribution  of  a  set  of  parameters, 
and  second  a  family  of  conditional  distributions  of  a  set  of 
observable  variables  given  potential  sets  of  parameter  values. 

The  first  part,  of  prior  distribution,  summarizes  a  set  of  beliefs 
or  state  of  knowledge  in  hand  before  any  observations  are  taken. 

The  second  part,  or  likelihood  function,  characterizes  the  informa¬ 
tion  carried  by  the  observations.  Specific  generalizations  are 
suggested  in  this  paper  for  both  parts  of  the  common  Bay  ,ian  model, 
and  also  for  the  method  of  combining  the  two  parts.  The  components 
of  these  generalizations  are  built  up  gradually  in  Section  2  where 
they  are  illustrated  on  a  model  for  trinomial  sampling. 

Inferences  will  be  expressed  as  probabilities  of  events  defined 
by  unknown  values ,  usually  unknown  parameter  values,  but  sometimes 
the  values  of  not  yet  observed  observables.  It  is  not  possible  here 
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to  go  far  into  the  much  embroiled  questions  of  whether  probabilities 
are  or  are  not  objective,  are  or  are  not  degrees  of  belief,  are  or 
are  not  frequencies,  and  so  on.  But  a  few  remarks  may  help  to  set 
the  stage.  I  feel  that  the  proponents  of  different  specific  views 
of  probability  generally  -share  more  attitudes  rooted  in  the  common 
sense  of  the  subject  than  they  outwardly  profess,  ana  that  careful 
analysis  renders  many  of  the  basic  ideas  more  complementary  than 
contradictory.  Definitions  in  terms  of  frequencies  or  equally 
likely  cases  do  illustrate  clearly  how  reasonably  objective  prob¬ 
abilities  arise  in  practice,  but  they  fail  in  themselves  to  say 
what  probabilities  mean  or  to  explain  the  pervasiveness  of  the 
concept  of  probability  in  human  affairs.  Another  class  of  defini¬ 
tions  stresses  concepts  like  degree  of  confidence  or  degree  of 
belief  or  degree  of  knowledge,  sometimes  in  relation  to  betting 
rules  and  sometimes  not.  These  convey  the  flavor  and  motivation  of 
the  science  of  probability,  but  they  tend  to  hide  the  realities 
which  make  it  both  possible  and  important  for  cognizant  people  to 
agree  when  assigning  probabilities  to  uncertain  outcomes.  The 
possibility  of  agreement  arises  basically  from  common  perceptions 
of  symmetries,  such  as  symmetries  among  cases  counted  to  provide 
frequencies,  or  symmetries  which  underlie  assumptions  of  exchange¬ 
ability  or  of  equally  likely  cases.  The  importance  of  agreement 
may  be  illustrated  by  the  statistician  who  expresses  his 
inferences  about  an  unknown  parameter  value  in  terms  of  a  set  of 
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betting  odds.  If  this  statistician  accepts  any  bet  proposed  at 
his  stated  odds,  and  if  he  wagers  with  colleagues  who  consistently 
have  more  information,  perhaps  in  the  form  of  larger  samples, 
then  he  is  sure  to  suffer  disaster  in  the  long  run.  The  moral 
is  that  probabilities  can  scarcely  be  "fair"  for  business  deals 
unless  both  parties  have  approximately  the  same  probability 
assessments,  presumably  based  on  similar  knowledge  or  information. 
Likewise  probability  inferences  can  contribute  little  to  public 
science  unless  they  are  as  objective  as  the  web  of  generally 
accepted  fact  on  which  they  are  based.  While  knowledge  may 
certainly  be  personal,  the  communication  of  knowledge  is  one  of 
the  most  fundamental  of  human  endeavors.  Statistical  inference 
can  be  viewed  as  the  science  whose  formulations  make  it  possible 
to  communicate  partial  knowledge  in  the  form  of  probabilities,.* 

Generalized  Bayesian  inference  seeks  to  permit  improvement  on 
classical  Bayesian  inference  through  a  complex  trade-off  of  advantages 
and  disadvantages.  On  the  credit  side,  the  requirement  of  a  global 
probability  law  is  dropped  and  it  becomes  possible  to  work  with  only 
those  probability  assumptions  which  are  based  on  readily  apparent 
symmetry  conditions,  and  are  therefore  reasonably  objective.  For 
example,  in  a  wide  class  of  sampling  models,  including  the  trinomial 
sampling  model  analyzed  in  Section  2 ,  no  probabilities  are  assumed 
except  the  familiar  and  noncontroversial  representation  of  a  sample 


as  n  independent  and  identically  distributed  random  elements  from 
a  population.  Beyond  this,  further  assumptions  like  specific 
parametric  forms  or  prior  distributions  for  parameters  need  be 
put  in.  only  to  the  extent  that  they  appear  to  command  a  fair 
degree  of  assent. 

The  new  inference  procedures  do  not  in  general  yield  exact 
probabilities  for  desired  inferences,  but  only  bounds  for  such 
probabilities.  While  it  may  count  as  a  debit  item  that  inferences 
are  less  precise  than  one  might  have  hoped,  it  is  a  credit  item 
that  greater  flexibility  is  allowed  in  the  representation  of  a  state 
of  knowledge.  For  example,  a  state  of  total  ignorance  about  an 
uncertain^  event  T  is  naturally  represented  by  an  upper  probability 
P*(T)  «  1  and  a  lower  probability  P*(T)  *  0.  The  new  flexibility 
thus  permits  a  simple  resolution  of  the  old  controversy  about  how 
to  represent  total  ignorance  via  a  probability  distribution.  In 
real  life,  ignorance  is  rarely  so  total  that  (0,1)  bounds  are  justified, 
but  ignorance  is  likely  to  be  such  that  a  precise  numerical  probability 
is  difficult  to  justify.  I  believe  that  experience  and  familiarity 
will  show  that  the  general  range  of  bounds  0  <  P^(T)  <  P*(T)  <  1 
provides  a  useful  tool  for  representing  degrees  of  knowledge. 

Upper  and  lower  probabilities  apparently  originated  with  Boole 
(1854),  and  have  reappeared  after  a  largely  dormant  period  in  Good 
(1962)  and  Smith  (1961,  1965).  In  this  paper  upper  and  lower 


probabilities  are  generated  by  a  specific  mathematical  device  whereby 

a  well-defined  probability  measure  over  one  sample  space  becomes 

diffused  in  its  application  to  directly  interesting  events.  In 

order  to  illustrate  the  idea  simply,  consider  a  map  showing  regions 

of  land  and  water.  Suppose  that  .80  of  the  area  of  the  map  is 

visible  and  that  the  visible  area  divides  in  the  proportions  .30 

to  .70  of  water  area  to  land  area.  What  is  the  probability  that  a 

point  drawn  at  random  from  the  whole  map  falls  in  a  region  of  water? 

Since  the  visible  water  area  is  .24  of  the  total  area  of  the  map, 

while  the  unobserved  .20  of  the  total  area  could  be  water  or  land, 

it  can  be  asserted  only  that  the  desired  probability  lies  between 

.24  and  .44.  The  model  supposes  a  well-defined  uniform  distribution 

over  the  whole  map.  Of  the  total  measure  of  unity,  the  fraction-  .24 

is  associated  with  water,  the  fraction  .56  is  associated  with  land, 

and  the  remaining  fraction  .20  is  ambiguously  associated  with  water 

or  land.  Note  the  implication  of  total  ignorance  of  the  unobserved 

area.  There  would  be  no  objection  to  introducing  other  sources  of 

» 

information  about  the  unobserved  area.  Indeed,  if  such  information 
were  appropriately  expressed  in  terms  of  an  upper  and  lower  probability 
model,  it  could  be  combined  with  the  above  information  using  a  rule  of 
combination  defined  within  the  mathematical  system.  A  correct  analogy 
can  be  drawn  with  prior  knowledge  of  parameter  values,  which  can  like¬ 
wise  be  formally  incorporated  into  inferences  based  on  sample  data, 
using  the  same  rule  of  combination.  The  general  mathematical  system,  as 


given  originally  in  Dempster  (1967a)  will  be  unfolded  in  Section  2 
and1  will  be  further  commented^  upon  in  Section  4. 

If  the  inference  procedures  suggested  in  this  paper  are  some¬ 
what  speculative  in  nature,  the  reason  lies,  I  believe,  not  in  a 
lack  of  objectivity  in  the  probability  assumptions,  nor  in  the  upper 
and  lower  probability  feature.  Rather,  the  source  of  the  speculative 
quality  is  to  be  found  in  the  logical  relationships  between  population 
members  and  their  observable  characteristics  which  are  postulated  in 
each  model  set  up  to  represent  sampling  from  a  population.  These 
logical  relationships  are  conceptual  devices,  which  are  not  regarded 
as  empirically  checkable  even  in  principle,  and  they  are  somewhat 
arbitrary.  Their  acceptability  will  be  analyzed  in  Section  5  where 
it  will  be  argued'  that  the  arbitrariness  may  correspond  to  something 
real  in  the  nature  of  an  uncertainty  principle. 

A  degree  of  arbitrariness  does  not  in  itself  rule  out  a  method 
of  statistical  inference.  For  example,  confidence  statements  are 
widely  used  in  practice  despite  the  fact  that  many  confidence  pro¬ 
cedures  are  often  available  within  the  same  model  and  for  the  same 
question,  and  there  is  no  well-established  theory  for  automatic  choice 
among  available  confidence  procedures.  In  part,  therefore,  the  use¬ 
fulness  of  generalized  Bayesian  inference  procedures  will  require  that 
practitioners  experiment  with  them  and  come  to  feel  comfortable  with 
them.  Relatively  few  procedures  are  as  yet  analytically  tractable, 
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but  two  examples  are  included,  namely,  the  trinomial  sampling 
inference  procedures  of  Section  2*  and  a  procedure  for  distin¬ 
guishing  between  monotone  upward  and  monotone  downward  sequences 
of  binomial  p^  as  given  in  Section  3.  Another  model  is  worked 
through  in  detail  in  Dempster  (1967b). 

Finally,  an  acknowledgment  is  due  to  R.A.  Fisher  who  announced 
with  characteristic  intellectual  boldness,  nearly  four  decades  ago, 
that  probability  inferences  were  indeed  possible  outside  of  the 
Bayesian  formulation.  Fisher  compiled  a  list  of  examples  and  guide¬ 
lines  which  seemed  to  him  to  lead  to  acceptable  inferences  in  terms 
of  probabilities  which  he  called  fiducial  probabilities .  The  mathe¬ 
matical  formulation  of  this  paper  is-  broad  enough  to  include  the 
fiducial  argument  in  addition  to  standard  Bayesian  methods.  But  the 
specific  models  which  Fisher  advocated,  depending  on  ingenious  but 
often  controversial  pivotal  quantities,  are  replaced  here  by  models 
which  start  further  back  at  the  concept  of  a  population  explicitly 
represented  by  a  mathematical  space.  Fisher  did  not  consider  models 
which  lead  to  separated  upper  and  lower  probabilities ,  and  indeed 
went  to  some  lengths,  using  sufficiency  and  ancillarity,  and 
arranging  that  the  spaces  of  pivotal  quantities  and  of  parameters  be 
of  the  same  dimension,  in  order  to  ensure  that  ambiguity  did  not 
appear.  This  paper  is  largely  an  exploration  of  fiducial-like  argu¬ 
ments  in  a  more  relaxed  mathematical  framework.  But,  since  Bayesian 
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methods  are  more  in  the  main  stream  of  development,  and  since  I 
do  explicitly  provide  for  the  incorporation  of  prior  information, 
I  now  prefer  to  describe  my  methods  as  extensions  of  Bayesian 
methods  rather  than  alternative  fiducial  methods.  I  believe  that 
Fisher  too.  regarded  fiducial  inference  as  being  very  close  to 
Bayesian  inference  in  spirit,  differing  primarily  in  that 
fiducial  inference  did  not  make  use  of  prior  information. 
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2 .  UPPER  AND  LOWER  PROBABILITY  INFERENCES  ILLUSTRATED 
ON  A  MODEL  FOR  TRINOMIAL  SAMPLING 

A  pair  of  sample  spaces  X  and  S  underlie  the  general  form 

of  mathematical  model  appearing  throughout  this  work.  The  first 

space  X  carries  an  ordinary  probability  measure  |X,  but  interest 

centers  on  events  which  are  identified  with  subsets  of  S.  A 

bridge  is  provided  from  X  to  S  by  a  logical  relationship  which 

* 

asserts  that,  if  x  is  the  realized  sample  point  in  X,  then  the 
realized  sample  point  s  in  S  must  belong  to  a  subset  Fx  of  S. 

Thus  a  basic  component  of  the  model  is  a  mathematical  transfor¬ 
mation  which  associates  a  subset  Tx  of  S  with  each  point  x  of  X. 

Since  the  Tx  determined  by  a  specific  x  contains  iri  general  many 
points  (or  branches  or  values) ,  the  transformation  x  -*  Fx  may  be 
called  a  multivalued  mapping.  Apart  from  measurability  considera¬ 
tions,  which  are  ignored  in  this  paper,  the  general  model  is  defined 
by  the  elements  introduced  above  and  will  be  labeled  (X,  S,  |X,  T)  for 
convenient  reference.  Given  (X,  S,  F) ,  upper  and  lower  proba¬ 
bilities  P*(T)  and  P*(T)  are  determined  for  each  subset  T  of  S. 

In  the  map  example  of  Section  1,  X  is  defined  by  the  points 
of  the  map,  S  is  defined  by  two  points  labeled  "water"  and  "land", 
p,  is  the  uniform  distribution  of  probability  over  the  map,  and  r  is 
the  mapping  which  associates  the  single  point  "water"  or  "land" 
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in  S  with  the  appropriate  points  of  the  visible  part  of  X,  and 
associates  both  points  of  S  with  the  points  of  the  unseen  part 
of  X.  For  set-theoretic  consistency,  TTx  should  be  regarded  as 
a  single  point  subset  of  S,  rather  than  a  single  point  itself, 
over  the  visible  part  of  X,  but  the  meaning  is  the  same  either 
way. 

The  general  definitions  of  P*(T)  and  P^(T)  as  given  in 
Dempster  (1967a)  are  repeated  below  in  more  verbal  form.  For  any 
subset  T  of  S,  define  T*  to  be  the  set  of  points  x  in  X 
for  which  Tx  has  a  nonempty  intersection  with  T,  and  define 
to  be  the  set  of  points  x  in  X  for  which  Tx  is  contained 
in  T  but  is  not  empty.  In  particular,  the  sets  S*  and  S* 
coincide  and  they  constitute  the  union  of  all  Tx  as  x  ranges 
over  X.  The  complement  X  -  S*  of  S*  . consists  of  those  x 
for  which  Tx  is  the  empty  set.  Now  define  the  upper  probability 
of  T  to  be 

(2.1)  P*(T)  -  M*(T*)/[i(S*) 
and  the  lower  probability  of  T  to  be 

(2.2)  P*(T)  -  H(T*)/  n(S*)  . 

Note  that,  since  T^C  T*  C  S*,  one  has 

0  <  P*<T>  <  P*(T)  <  1- 


(2.3) 
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Also,  if  T  is  the  complement  of  T  in  S,  then  and  T* 

are  respectively  the  complements  of  T*  and  in  S*,  so  that 

(2.4)  P*(T)  -  1  -  P*(T)  and  P*(T)  =  1  -  P*(T)  . 

**"r 

Other  formal  consequences  of  the  above  definitions  are  explored 
in  Dempster  (1967a). 

The  heuristic  conception  which  motivates  (2.1)  and  (2.2)  is 
the  idea  of  carrying  probability  elements  d[i.  from  X  to  S 
along  the  branches  of  the  mapping  fx.  The  ambiguity  in  the 
consequent  probability  measure  over  S  occurs  because  the  probability 
element  dp.(x)  associated  with  x  in  X  may  be  carried  along  any 
branch  of  Px  or,  more  generally,  may  be  distributed  over  the  dif¬ 
ferent  branches  of  rx  .  for  each  x.  Part  of  the  |i  measure, 
namely  the  measure  of  the  set  X  -  S*  consisting  of  points  x  such 
that  Tx  is  empty,  cannot  be  moved  from  X  at  all.  Since  there  is 
an  implicit  assumption  that  some  s  in  S  is  actually  realized,  it 
is  appropriate  to  condition  by  S*  when  defining  relevant  probabili¬ 
ties.  This  explains  the  divisor  (J,(S*)  appearing  in  (2.1)  and  (2.2). 
Among  all  the  ways  of  transferring  the  relevant  probability  [i(S*) 
from  X  to  S  along  branches  of  Tx,  the  largest  fraction  which 
can  possibly  follow  branches  into  T  is  P*(T) ,  while  the  smallest 
possible  fraction  is  P*(T),  Thus  conservative  probability  judg¬ 
ments  may  be  rendered  by  asserting  only  that  the  probability  of  T 
lies  between  the  indicated  upper  and  lower  bounds. 
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It  may  also  be  illuminating  to  view  Tx  as  a  random  set  in  S 
generated  by  the  random  point  x  in  X,  subject  to  the  condition 
<chat  rx  is  not  empty.  After  conditioning  on.  S*,  P*(T)  is  the 
probability  that  the  random  set  Tx  intersects  the  fixed  set  T, 
while  P*(T)  is  the  probability  that  the  random  set  Tx  is  con¬ 
tained  in  the  fixed  set  T. 

A  probability  model  like  (X,  S,  p.,  r)  may  be  modified  into 
other  probability  models,  of  the  same  general  type  by  conditioning 
on  subsets  of  S.  Such  conditioning  on  observed  data  defines  the 
generalized  Bayesian  inferences  of  this  paper.  Beyond  and  general*:* 
izing  the  concept  of  conditioning,  there  is  a  natural  rule  for 
combining  or  multiplying  several  independent  models  of  the  type 
(X,  S,  p.,  r)  to  obtain  a  single  product  model  of  the  same  type. 

For  example,  the  models  for  n  independent  sample  observations 
may  be  put  together  by  the  product  rule  to  yield  a  single  model 
for  a  sample  of  size  n,  and  the  model  defining  prior  information 
may  be  combined  with  the  model  carrying  sample  information  by  the 
same  rule.  The  rules  for  conditioning  and  multiplying  will  be 
transcribed  below  from  Dempster  (1967a)  and  will  be  illustrated 
on  a  model  for  trinomial  sampling.  First,  however,  the  elements  of 
the  trinomial  sampling  model  will  be  introduced  for  a  sample  of 
size  one. 

Each  member  of  a  large  population,  shortly  to  be  idealized  as 
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an  infinite  population,  is  supposed  known  to  belong  to  one  of  three 
identifiable  categories  c^>c2  an<^  c3  >  w^ere  the  integer  subscripts 
do  not  indicate  a  natural  ordering  of  the  categories.  Thus  the 
individuals  of  the  population  could  be  balls  in  an  urn,  identical 
in  appearance  apart  from  their  colours  which  are  red  (c^)  or  white 
(02)  or  blue  (c^).  A  model  will  be  defined  which  will  ultimately 
lead  to  procedures  for  drawing  inferences  about  unknown ‘population 
proportions  of  c- ^  and  c^  given  the  categories  of  a  random' 
sample  of  size'  n  from  the  population.  Following  Dempster  (1966), 
the  individuals  of  the  population  will  be  explicitly  represented  by 
the  points  of  a  space  U,  and  the  randomness  associated  with  a 
sample  individual  drawn  from  U  will  be  characterized  by  a  proba¬ 
bility  measure  over  U.  Thus,  a  finite  population  of  size  N  could 
be  represented  by  any  finite  space  U  with  N  elements ,  with  random 
sampling  represented  by  the  uniform  distribution  of  probability  over 
the  N  elements  of  U.  Such  a  finite  population  model  is  analyzed  in 
detail  in  Dempster  (1967b).  Here,  however,  the  population  is  treated 
as  infinite,  and,  for  reasons  tied  up  with  the  trinomial  observable, 
the  space  U  is  identified  with  a  triangle.  Convenient  barycentric 
coordinates  for  a  general  point  of  U  are 


(2.5) 


U  -  (U^U^Ug)  , 
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where  0  <  u^,  0  <  u,^,  0  <  and  +  u2  +  u3  “  1*  See  Figure  1* 

It  is  further  supposed  that  a  random  sample  of  size  one  means  an 
individual  u  drawn  according  to  the  uniform  distribution  p  over 
the  triangle  U.  In  the  model  (X,  S,  |x,  r)  representing  a  random 
sample  of  size  one  from  a  trinomial  population’ the  roles  of  X  and 
|i  will  be  played  by  U  and  p. 

Two  further  spaces  enter  naturally  into  the  model  for  a  single 
trinomial  observation.  The  first  is  the  three  element  space 
C  =  {ci»c2»c3}  whose  general  member  c  represents  the  observable 
category  of  the  sample  individual.  The  second  is  the  space  n  whose 
general  point  is 

(2.6)  £  *  <3r1»ir2,ir3>  > 

with  0  <  7r^ ,  0  <  t2  ,  0  <  ^  and  ir^  +  7^  +  =  1  where  tj\  is 

to  be  interpreted  -,for  i  «  1,2,3  as  the  proportion  of  the  pppula- 
tion  falling  in  category  c^.  Note  that  n  is  a  mathematical 
copy  of  U,  but  its  applied > meaning  is  distinct  from  that  of  U. 

The  role  of  S  in  the  general  model  (X,  S,  |x,  r)  will  be  played 
by  the  product  space  C  x  H  which  represents  jointly  the  observa¬ 
tion  on  a  single  random  individual  together  with  the  population 
proportions  of  ci>c2  an<*  c3  •  Finally,  the  role  of  F  is  played 
by  B  where,  for  any  ju  ■  in  U,  the  set  Bu,  in  C  x  0  consists 
of  the  points  (c^,ir) 


such  that 


(2.7) 


IT. 

1 


max 


> 


u. 

1 


for  i  =  1,  2,  3.  To  understand  the  definition  of  B,  but  not  yet 
the  motivation  for  the  definition,  it  is  helpful  to  visualize 
C  x  II  as  a  stack  of  three  triangles  as  in  Figure  2  where  the 
three  levels  correspond  to  the  three  points  of  C.  The  contribu¬ 
tions  to  Bu  from  the  three  levels  of  C  x  II  are  shown  as  shaded 
areas  in  Figure  2 .  It  is  important  also  to  understand  the  inverse 
mapping  B  ^  which  carries  points  of  C  x  II  to  subsets  of  U, 
where 

(2.8)  U.  =  B_1(c.  ,ir) 

1  1  •'V/ 

is  defined  to  be  the  subset  of  U  consisting  of  points  u  for 
which  Bu  contains  (c^,tt).  The  subsets  U^>U2,U3  definet*  by  a 
given  ir  in  1  are  illustrated  in  Figure  1. 

It  is  easily  checked  with  the  help  of  Figure  1  that 

(2.9)  p(U£)  =  tt±  and  p(U.nU.)  *0 

for  i,j  =  1,  2,  3  and  i  £  j.  It  will  be  shown  later  that  Lhe 
property  (2.9)  is  a  basic  requirement  for  the  mapping  B  defined 
in  (2.7).  Other  choices  of  U  and  B  could  be  made  which  would 
also  satisfy  (2.9).  Some  of  these  choices  amount  to  little  more 
than  adopting  different  coordinate  systems  for  U,  but  other 


(t.0,0) 


^  >  ••  ^  >  **  j 


Figure  1,  A  triangle  representing  the  space  U,  showing  the' 
barycentric  coordinates  of  the  three  vertices  of  U  together 
with  a  general  point  u  =  (ui»u2>u3^#  fchree  closed  sub¬ 

triangles  labelled  U- ,U0  and  U0  with  a  common  vertex  at  r 

1  *  J  'v* 

represent  the  subsets  of  U  consisting  of  points  u  such  that 
Bu  contains  (c.  ,7 r) ,  (c0,7r)  and  (c„,7r),  respectively. 


(ci  >u) 


Level  c  a  ci 


Level 


Level  c  »  e. 


L^e  TJ^  88 


vrtz  ir3*  o 


Lue  if3-0, 


#1*  1 


LUe  1T^ 

ir3*  x 


Figure  2.  The  space  C  x  n  represented  as  triangles  on  three 
levels.  The  three  closed  shaded  regions  together  make  up  the  - 
subset  Bu  determined  from  a  given  u. 
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possible  choices  differ  in  a  more  fundamental  way.  Thus  an  element 
of  arbitrariness  enters  the  model  for  trinomial  sampling  at  the  point 
of  choosing  U  and  B.  The  present  model  was  introduced  in 
Dempster  (1966)  under  the  name  structure  of  the  second  kind.  Other 
possibilities  will  be  mentioned  in  Section  5. 

All  of  the  pieces  of  the  model  (U,  C  x  II,  p,  B)  are  now  in 
place,  so  that  upper  and  lower  probabilities  may  be  computed  for 
subsets  T  of  C  x  II.  It  turns  out,  however,  that  P*(T)  =  1  and 
P*(T)  *  0  for  interesting  choices  of  T,  and  that  interesting 
illustrations  of  upper  and  lower  probabilities  are  apparent  only 
after  conditioning.  For  example,  take  T  to  be  the  event  that 
category  c^  will  be  observed  in  a  single  drawing  from  the  popula¬ 
tion,  i.e.,  T  =  x  Ii  where  is  the  subset  of  C  consisting 

of  c^  only.  To  check  that  P*(T)  =  1  and  P^(T)  =  0,  note 
(i)  that  T*  *  U  because  every  u  in  U  lies  in  of  Figure  1 

for  some  (c^,7r)  in  x  II,  and  (ii)  that  T^  is  empty  because 
no  u  in  U  lies  in  U,  for  all  (c..  ,7r)  in  C,  x  II.  In  general, 
any  nontrivial  event  governed  by  C  alone  or  by  II  alone  will  have 
upper  probability  unity  and  lower  probability  zero.  Such  a  result 
is  sensible,  for  if  no  information  about  7r  is  put  into  the  system 
no  information  about  a  sample  observation  should  be  available,  while 
if  no  sample  observation  is  in  hand  there  should  be  no  available 
information  about  w.  (Recall  the  interpretation  suggested  in 
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Section  1  that  P*(T)  =  1  and  P^(T)  =  0  should  convey  a  state  of 
complete  ignorance  about  the  whether  or  not  the  real  world  outcome 
s  will  prove  to  lie  in  T.) 

Turning  now  to  the  concept  of  upper  and  lower  conditional 
probabilities,  the  definition  which  fits  naturally  with  the  general 
model  (X,  S,  |X,  r)  arises  as  follows.  If  information  is  received 
to  the  effect  that  sample  points  in  S  -  T  are  ruled  out  of  consid¬ 
eration,  then  the  logical  assertion  "x  in  X  must  correspond  to 
s  in  Txc  S"  is  effectively  altered  to  read  "x  in  X  must 
correspond  to  s  in  FxHT  in  S."  Thus  the  original  model 
(X,  S,  [x,  r)  is  conditioned  on  T  by  altering  (X,  S,  fx,  r)  to 

i 

(X,  S,  p,  f ) ,  where  the  multivalued  mapping  f  is  defined  by 

(2.io)  rx  -  rx n t  . 

Under  the  conditioned  model,  an  outcome  in  S  -  T  is  regarded  as 
impossible,  and  indeed  the  set  S  -  T  has  upper  and  lower  conditional 
probabilities  both  zero.  It  is  sufficient  for  practical  purposes, 
therefore,  to  take  the  conditional  model  to  be  (X,  T,  [x,  f)  and 
to  consider  upper  and  lower  conditional  probabilities  only  for 
subsets  of  T. 

Although  samples  of  size  one  are  of  little  practical  interest, 
the  model  for  a  single  trinomial  observation  provides  two  good 
illustrations  of  the  definition  of  a  conditioned  model.  First,  it 
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will  be  shown  that  conditioning  on  a  fixed  value  of  \ r  =  (tt^,^  j^) 
results  in  7J\  being  both  the  upper  and  lower  conditional  proba¬ 
bility  of  an  observation  c^,  for  i  =  1,  2,  3.  This  result  is 
equivalent  to  (2.S)  and  explains  the  importance  of  (2.9),  since  any 
reasonable  model  should  require  that  the  population  proportions  be 
the  same  as  the  probabilities  of  the  different  possible  outcomes  in 
a  single  random  drawing  when  the  population  proportions ' are  known. 
Second,  it  will  be  shown  that  nontrivial  inferences  about  v  may  be 
obtained  by  conditioning  on  the  observed  category  c  of  a  single 
individual  randomly  drawn  from  U. 

In  precise  mathematical  terms ,  to  condition  the  trinomial 
sampling  model  (U,  C  x  II,  p,  B)  on  a  fixed  7 r  is  to  condition 
on  T  »  C  x  n,  where  n  is  the  subset  of  n  consisting  of  the 
single  point  w.  T  itself  consists  of  the  three  points  (c^,7r), 

(C2  ,tt)  and  (c^tt)  which  in  turn  define  single  point  subsets  T^^ 
and  of  T.  The  conditioned  model  may  be  written  (U,  T,  p,  B) 
where  Bu  *  BuflT  for  all  u.  By  referring  back  to  the  definition 
of  B  as  illustrated  in  Figures  1  and  2 ,  it  is  easily  checked  that 
the  set  of  u  in  U  such  that  Bu  intersects  T.  is  the  closed 

V  rv  j]L 

triangle  IL  appearing  in  Figure  1,  while  the  set  of  u  in  U 
such  that  Bu  is  contained  in  T^  is  the  open  triangle  IL,  for 
i  =1,  2,  3.  Whether  open  or  closed,  the  triangle  IL  has  measure 
,  and  it  follows  easily  from  (2.9)  that  the  upper  and  lower 
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conditional  probabilities  of  given  T  are 

(2.11)  P*(T.|T)  =  P^(T.|T)  =  7T.  , 

for  i  =  1,  2,  3.  Note  that  Bu  is  not  empty  for  any  u  in  U, 
so  that  the  denominators  in  (2.1)  and  (2.2)  are  both  unity  in  the 
application  (2 . 11) . 

Consider  next  the  details  of  conditioning  the  trinomial  model 
on  a  fixed  observation  c^.  The  cases  where  a  single  drawing  pro¬ 
duces  C2  or  c^  may  be  handled  by  permuting  indices.  Observing 

'V 

c^  is  formally  represented  by  conditioning  on  T  »  x  II  where 
as  above  is  the  subset  of  C  consisting  of  c^  alone.  In  the 

■V/  /v» 

conditional  model  (U,  T,  p,  B) ,  the  space  T  is  represented  by  the 
first  level  in  Figure  2  while  Bu  is  represented  by  the  closed 
shaded  region  in  that  first  level.  Since  Bu  is  nonempty  for  all 
u  in  U,  the  p  measure  may  be  used  directly  without  renormalization 

*  .  *v/ 

to  compute  upper  and  lower  conditional  probabilities  given  T.  An 
event  R  defined  as  a  subset  of  n  is  equivalently  represented  by 

-V  « 

the  subset  x  R  of  T.  The  upper  conditional  probability  of 
x  R  given  T  is  the  probability  that  the  random  region  Bu 
intersects  x  R  where  (c^,u)  is  uniformly  distributed  over 
x  n.  See  Figure  3.  Similarly  the  lower  conditional  probability 

ns  _/ 

of  x  R  given  T  is  the  probability  that  the  random  region  Bu 
j.  v 

is  contained  in  x  R.  For  example,  if  R  is  the  lower  portion 


Cct>  i.o.o) 


0^,  R 


/ - > 

4^:: 

#8=^1 

?.  i  . 

f  \ 

J 

\ 

111 

CCl’UX*Ut»'U3^ 


Cclt  0,1,01 


Figure  3.  The  triangle  T  =  x  II  for  the  model  conditioned 
on  the  observation  c1 .  Horizontal  shading  covers  the  region 
Bu,  while  vertical  shading  covers  a  general  fixed  region 
C„  x  R. 
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2 

of  the  triangle  where  0  <  <  7r^n,  then  P*(C^  x  R|T)  =  1-(1  -  7 t^") 

=  tt ."(2  -  7 r")  and  P.(C.  x  R|T)  =0.  Or,  in  more  colloquial  notation, 
P*(0  <  7T1  <  7T1"|c  =  c1)  =  7T1"(2  -  77^")  and  PVf(0  <  7T1  <  7T1"|c  =  -  0. 

More  generally,  it  can  easily  be  checked  that 

(2.12)  P*(7T1‘  £  7TX  <  TT^'lc  =  CL)  -  7^"  (2  -  7 T^'), 
while 

(2.13)  ?-k(7ri  <  Ti  <  7r1" I c  "  Cj,)  ■  0  if  7T1n  <  1 

■  (1  -  tt-')^  if  tt-"  * 

for  any  fixed  tt^1  and  tt^"  satisfying  0  <  ir^'  <  7r^n  <  1.  Likewise, 

(2.14)  P*(7T2'  £  tt2  <  ir2"|c  «  cx)  -  1  -  7T2 *  , 

while 

(2.15)  P*(7r2 1  <  7r2  <  7r2"|c  »  c1)  «  0  if  tt2'  >  0 

»  7r2"  if  7r2  ‘  »  0,  4 

for  any  fixed  7r2'  and  7r2"  satisfying  0  <  7r2 '  <  7^"  <  1.  Relations 
(2.14)  and  (2.15)  also  hold  when  subscripts  2  and  3  are  interchanged. 
Formulas  (2.12)  through  (2.15)  are  the  first  instances  of  generalized 
Bayesian  inferences  reached  in  this  paper,  v?here,  as  will  shortly  be 
explained,  prior  knowledge  of  w  is  tacitly  assumed  to  have  the  null 
form  such  that  all  upper  probabilities  are  unity  and  all  lower 
probabilities  are  zero.  For  example,  the  model  asserts  that,  if 
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a  single  random  individual  is  observed  to  belong  in  category  c^, 
and  no  prior  knowledge  of  r  is  assumed,  it  may  be  inferred  that 
at  least  half  the  population  belongs  in  with  probability  between 
1/4  and  1. 

A  collection  of  n  models  (X^,  S,  p.^,  r^)  for  i  =  l,2,...,n 
may  be  combined  or  multiplied  to  obtain  a  product  model  (X,  S,  p.,  F) . 
The  formal  definition  of  (X,  S,  (i,  T)  is  given  by 


(2.16) 


~X  «  X(1)  x  X<2>  x  ...  x  X(n), 

<  p.  =  p/*^  x  p/2^  x  ...  x  p/n\  and 
rx  *  r^x^ fi  r^x^  0  ...  0  r^x^ 


where  x  «  (x^,  x^2\  ...,  x^)  denotes  a  general  point  of  the 
product  space  X.  The  product  model  is  appropriate  where  the' 
realized  values  x^ ,  x^2\  ...,  x^  are  regarded  as  independently 
random  according  to  the  probability  measures  p,^,  p/2\  . ..,  p/n\ 
while  the  logical  relationships  implied  by  F^\  . ..,  F^  are 

postulated  to  apply  simultaneously  to  a  common  realized  outcome  s 
in  S.  It  may  be  helpful  to  view  the  models  (X^\  S,  p.^ ,  F^) 
as  separate  sources  of  information  about  the  unknown  s  in  S.  In 
such  a  view,  if  the  n  sources  are  genuinely  independent,  then  the 
product  rule  (2.16)  represents  the  legitimate  way  to  pool  their 


information. 
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The  concept  of  a  product  model  actually  includes  the  concept 
of  a  conditioned  model  which  was  introduced  earlier.  Proceeding 
formally,  the  information  that  T  occurs  with  certainty  may  be 
represented  by  a  degenerate  modal  (Y,  S,  v,  A)  where  Y  consists 
of  a  single  point  yt  while  Ay  **  T  and  y  carries  v  measure 
unity.  Multiplying  a  general  model  (X,  S,  p.,  F)  by  (Y,  S,  v.  A) 
produces  essentially  the  same  result  as  conditioning  the  general 
model  (X,  S,  p,,  F)  on  T.  For  X  x  Y  and  p.  x  v  are  isomorphic 
in  an  obvious  way  to  X  and  p.,  while  Fx  n  Ay  ■  Fx  flT  «  Fx  as  in 
(2.10).  Thus  the  objective  of  taking  account  of  information  in  the 
special  form  of  an  assertion  that  T  must  occur  may  be  reached 
either  through  the  rule  of  conditioning  or  through  the  rule  of 

t 

multiplication,  with  identical  results.  In  particular,  when  T  *  S 
the  degenerate  model  (Y,  S,  v,  A)  conveys  no  information  about  the 
uncertain  outcome  s  in  S,  both  in  the  heuristic  sense  that  upper 
and  lower  probabilities  of  nontrivial  events  are  unity  and  zero, 
and  in  the  formal  sense  that  combining  such  a  (Y,  S,  v,  a)  with 
any  information  source  (X,  S,  p.,  F)  leaves  the  latter  model 
essentially  unaltered. 

Product  models  are  widely  used  in  mathematical  statistics  to 
represent  random  samples  of  size  n  from  infinite  populations,  and. 
they  apply  directly  to  provide  the  general  sample  size  extension  of 
the  trinomial  sampling  model  (U,  C  x  II,  p ,  B) .  A  random  sample  of 
size  n  from  the  population  U  is  represented  by  u^  ,\j,^  , . . .  ,u5n^ 


independently  drawn  from  U  according  to  the  same  uniform  proba¬ 
bility  measure  p.  More  precisely,  the  sample  (u^,u^2\. . .  ,u^) 
is  represented  by  a  single  random  point  drawn  from  the  product  space 

(2.17)  Un  -  U(1)  x  U<2>  x  ...  x  U(n) 


according  to  the  product  measure 


where  the  pairs  (U^,p^),  (U^  ,p^)  , . . . ,  (U^  ,p^)  at  n 
identical  mathematical  copies  of  the  original  pair  (U,  p).  In 
similar  way,  the  observable  categories  of  the  n  sample  individuals 
are  represented  by  a  pointy  in  the  product  space 


(2.19) 


ca>  x  c(2) 


where  is  the  three  element  space  from  which  the  observable 

category  c^  of  the  sample  individual  u,^  is  taken.  The 
interesting  unknowns  before  sampling  are  c^  ,c^  , . . .  ,c^  and 
which  define  a  point  in  the  space  Cn  x  II.  Accordingly,  the 
model  which  represents  a  random  sample  of  size  n  from  a  trinomial 
population  is  of  the  form  (Un,  Cn  x  H,  pn,  Bn) ,  where  it  remains 
only  to  define  Bn.  In  words,  Bn  is  the  logical  relationship 
which  requires  that  (2,7)  shall  hold  for  each  u,^.  In  symbols, 
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(2.20) 


n  (1)  (2)  (n)v 

a  >  •  •  •  >U  ) 


m  «  u(2)„(2)  o  n  »(n)„(n) 


u 

A/ 


n  b' 


u 


n  ...  n  b' 


u 

/V 


where  B^u^  consists  of  those  points  .  ,c^n\  jr) 

in  Cn  x  II  such  that 


(2.21) 


for  k  =  1,2 ,3, 


7T, 


max 


TTi 


U 


\ 


TT0 


7T0 


O)  *  „(i)  ’  u(i> 

1  °2  u3 


The  model  (un,  Cn  x  II,  pn,  Bn)  now  completely  defined 
provides  in  itself  an  illustration  of  the  product  rule.  For  (2.17), 
(2.18)  and  (2.20)  are  instances  of  the  three  lines  of  (2.16),  and 
hence  show  that  (Un,  Cn  x  II ,  pn,  Bn)  is  the  product  of  the  n  ‘ 
models  (U(l),  Cn  x  II,  p(l),  B(l))  for  i  =  1,2,..., n,  each 
representing  an  individual  sample  member. 

As  in  the  special  case  n  «  1,  the  model  (Un,  Cn  x  II,  pn,  3n) 
does  not  in  itself  provide  interesting  upper  and  lower  probabilities. 
Again,  conditioning  may  be  illustrated  either  by  fixing  j,  and 
asking  for  probability  judgments  about  c^  ,c^  ,. . .  ,c^  or 
conversely  by  fixing  c^  ,c^ , . . .  ,c^  and  asking  for  probability 
judgments  (i.e.,  generalized  Bayesian  inferences)  about  tt.  Condi¬ 
tioning  on  fixed  £  leads  easily  to  the  expected  generalization  of 
(2.11).  Specifically,  if  T  is  the  event  that  g  has  a  specified 
value,  while  T  is  the  event  that  c^,c^ , . .'.  ,c^  are  fixed, 
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with  iu  observations  in  category  for  i  =  1,2,3,  then 
(2.22)  P  (T 1 T)  =  P*(T|T)  «  . 

The  converse  approach  of  conditioning  on  T  leads  to  more  difficult 
mathematics. 

Before  . . .  ,c^  are  observed,  the  relevant  sample 

space  Cn  x  n  consists  of  3n  triangles,  each  a  copy  of  H. 
Conditioning  on  a  set  of  recorded  observations  c^\c^ , . . .  ,c^ 
reduces  the  relevant  sample  space  to  the  single  triangle  associated 
with  those  observations.  Although  this  triangle  is  actually  a  sub¬ 
set  of  C°  x  n,  it  is  essentially  the  same  as  II  and  will  be 
formally  identified  with  II  for  the  remainder  of  this  discussion. 
Conditioning  the  model  (Un,  Cn  x  H,  M<n>  Bn)  on  ,c^  , . . .  ,c^ 

leads  therefore  to  the  model  (U1*,  n,  |in,  Bn)  where  Bn  is  defined 
by  restricting  Bn  to  the  appropriate  copy  of  H.  The  important 
random  subset  Bn(u_^  ,u^)  , . . .  ,ufn^)  of  II  defined  by  the  random 
sample  . . .  ,jufn^  will  be  denoted  by  V  for  short.  V 

determines  the  desired  inferences,  i.e.,  the  upper  and  lqwer  proba¬ 
bilities  of  a  fixed  subset  R  of  H  are  respectively  the  probability 
that  V  intersects  R  and  the  probability  that  V  is  contained  in 
R,  both  conditional  on  V  being  nonempty. 

V  is  the  intersection  of  the  n  random  regions  B^uf^  for 
i  *  l,2,...,n  where  each  B^uf^ 


is  one  of  the  three  types 
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illustrated  on  the  three  levels  of  Figure  2 ,  the  type  and  level 
depending  on  whether  the  observation  is  c^,c 2_  or  c^. 

Figure  4  illustrates  one  such  region  for  n  =  4.  It  is  easily 
discovered  by  experimenting  with  pictures  like  Figure  4  that  the 
shaded  region  V  may  have  3,  4,  5  or  6  sides,  but  most  often  is 
empty.  It  is  shown  in  Appendix  I  that  V  is  nonempty  with  proba¬ 
bility  n^ln^.n^I/nl  under  independent  uniformly  distributed 
ui1},J2),...,u<n).  Moreover,  conditional  on  nonempty  V,  six 
random  vertices  of  V  are  shown  in  Appendix  I  to  have  Dirichlet 
distributions.  Specifically,  define  W,^  for  i  =  1,2,3  to  be 

point  tt  in  V  with  maximum  coordinate  w.  and  define 
^  1  . 

* 

for  i  =1,2,3  to  be  the  point  t  in  V  with  minimum  coordinate 
7J\.  These  6  vertices  of  V  need  not  be  distinct,  but  are  distinct 
with  positive  probability  and  so  have  different  distributions.  Their 

D(nx  +  l»n2  »n3> 

D(nx ,n2  +  l,n3) 

^nl,n2:,n3  +  1) 

D(n1,n2  +  l,n3  +  1) 

D(n^  +  l,n2,n3  +  1) 

D(nx  +  1,1^  +  l,n3) 


distributions  are 


(2.23) 


W 


W 


w 

<-v/ 


(1) 

(2) 

(3) 


(1) 


(2) 


Z 


(3) 


0- 


(i,o,o) 


(0,1,0) 


(3^ 


(0,0,1) 


Figure  4.  The  triangle  II  representing  the  sample  space  of  unknowns 
after  n  =  4  observations  •  c\^  =  c^,c^  =  Cg,c^  =  c^,c^  = 
have  been  taken.  The  shaded  region  is  the  realization  of  V 
determined  by  the  illustrated  realization  of  u^.u^.u^  and  u^ 

/v  'aJ 


li  t 
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where  D^,^,^)  denotes  the  Dirichlet  distribution  over  the 

triangle  fl  whose  probability  density  function  is  proportional 
r.-l  ^-1  r_-l 

to  ir^  ^  7T ^  .  The  Dirichlet  distribution  is  defined 

as  a  continuous  distribution  over  n  if  r^  >  0  for  i  =  1,2,3. 
Various  conventions,  not  listed  here,  are  required  to  cover  the 
distributions  of  the  six  vertices  when  some  of  the  n^  are  zero. 

Many  interesting  upper  and  lower  probabilities  follow  from 
the  distributions  (2.23).  For  example,  the  upper  .probability  that 
ir^  exceeds  ir  is  the  probability  that  V  intersects  the  region 
where  >  tt^'  which  is  in  turn  the  probability  that  the  first 
coordinate  of  exceeds  .  In  symbols, 


P  (w1  >  7 r1'  |n1,n2  »«3> 


(2.24) 


I  J  n- i (n„-l) l 


0 


n^  ^-l  n^-l 

I:(n2-l)l(n3-l)l  ^1  W2  ^3  d7r!d7r2 


l  ,  ni 

J  f  n^i O^+n^-l) l  1  v  r 

TT- 


n2+n3*’l 


d7T, 


if  ng  >  0  and  n^  >  0.  Similarly,  P^(7r^  >  7r^' In^,^  ,n^)  is  the 
probability  that  the  first  coordinate  of  exceeds  r.  * ,  i.e.. 
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(2.25) 


Pyc (ti*!  >  |n1,n2>n3) 

,  f1  (n+l)  1 

\  ,  (n1-l)  (n2+n3+l) ' 


nT-l  ru+iu+l 

^  (1  -7r1)  Z  , 


again  assuming  no  prior  information  about  tt.  Two  further  analogues 

of  the  pair  (2.24)  and  (2.25)  may  be  obtained  by  permuting  the 

indices  so  that  the  role  of  1  is  played  successively  by  2  and  3. 

In  a  hypothetical  numerical  example  with  n^  *  2 ,  **  1*,  n^  **  1  as 

used  in  Figure  4,  it  is  inferred  that  the  probability  of  at  least 

3  11 

half  the  population  belonging  in  c^  lies  between  yg-  and  yg-.  In 

passing,  note  that  the  upper  and  lower  probabilities  (2.24)  and 

(2.25)  are  formally  identical  with  Bayes  posterior  probabilities 

corresponding  to  the  pseudoprior  distributions  D(1,0,0)  and  D(0,1,1), 

respectively.  This  appears  to  be  a  mathematical  accident  with  a 

limited  range  of  applicability,  much  like  the  relations  between 

fiducial  and  Bayesian  results  pointed  out  by  Lindley  (1958).  In  the 

♦ 

present  situation,  it  could  be  shown  that  the  relations  no  longer 
hold  for  events  of  the  form  (ir^*  <7 jy  <  7jy,r) . 

The  model  (Un,  Cn  x  II,  pn,  Bn)  has  the  illuminating  feature 
of  remaining  a  product  model  after  conditioning  on  the  sample  obser¬ 
vations.  Recall  that  the  original  model  (Un,  Cn  x  II,  pn,  Bn)  is 
expressible  as  the  product  of  the  n  models  (U^,  C°  x  n ,  p^,  B^) 
for  i  ■  l,2,...,n.  Conditioning  the  original  model  on  the  observations 
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yields  (U11,  T,  pn,  B11)  where,  as  above  T  is  the  subset  oj 

(n) 


Cn  x  II  with  c '  '  ,c '  '  , . . .  ,c 
and 


(X)  .(2) 


fixed  at  their  observed  values 


(2.26) 


-n,  (1)  (2)  (n). 


=  Bn(u(1),u(2),...,u(n))  n  T. 


Conditioning  the  ith  component  model  on  the  ith  sample  obser¬ 
vation  yields  (U^,  T^,  p^,  B^)  where  is  the  subset 

of  Cn  x  II  with  c^  fixed  at  its  observed  value,  and 


(2.27) 


g(i)u(i)  a  n 


for  i  *  l,2,...,n.  It  is  clear  that 


(2.28) 


t  =  t(1)  n  t(2)  n  ...  n  T(n)  , 


and  from  (2.20),  (2.26),  (2.27)  and  (2.28)  it  follows  that 


(2.29) 


~n,  (1)  (2) 

B  (u,  ,uv 


u(n)) 

J  •  *  #  J 

/V  9  'a/ 


=  /b^1^1^  n  B^up^  n  ...  n  b’^u^ 

/O  a/  a/ 


From  (2.28)  and  (2.29)  it  is  immediate  that  the  model  (Un,  T,  pn, %n) 
is  the  product  of  the  n  models  (U^,  T^,  p^,  B^)  for 
i  »  l,2,...,n.  The  meaning  of  this  result  is-  that  inferences  about 


II  may  be  calculated  by  traversing  two  equivalent  routes.  First, 
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as  above,  one  may  multiply  the  original  n  models  and  condition 
the  product  on  T.  Alternatively,  one  may  condition  the  original 
n  models  on  their  associated  T  and  then  multiply  the  conditioned 

models.  The  availability  of  the  second  route  is  conceptually 
interesting,  because  it  shows  that  the  information  from  the  ith 
sample  observation  o'1  ’  may  be  isolated  and  stored  in  the  form 
(U^,  ,  p^,  B^),  and  when  the  time  comes  to  assemble  all 

the  information  one  need  only  pick  up  the  pieces  and  multiply  them. 

This  basic  result  clearly  holds  for  a  wide  class  of  choices  of  U  and 
B,  not  just  the  particular  trinomial  sampling  model  illustrated  here. 

The  separability  of  sample  information  suggests  that  prior  infor¬ 
mation  about  jr  should  also  be  stored  as  a  model  of  the  general  type 
(X,  n,  [j.,  r)  and  should  be  combined  with  sample  information  according 
to  the  product  rule.  Such  prior  information  could  be  regarded  as  the 
distillation  of  previous  empirical  data.  This  proposal  brings  out 
the  full  dimensions  of  the  generalized  Bayesian  inference  scheme. 

Not  only  does  the  product  rule  show  how  to  combine  individual  pieces 
of  sample  information,  but  it  handles  the  incorporation  of  prior 
information  as  well.  Moreover,  the  sample  information  and  the  prior 
information  are  handled  symmetrically  by  the  product  rule,  thus 
banishing  the  asymmetric  appearance  of  standard  Bayesian  inference. 

At  the  same  time ,  if  the  prior  information  is  given  in  the  standard 
form  of  an  ordinary  probability  distribution,  the  methods  of  generalized 
Bayesian  inferences  reproduce  exactly  the  standard  Bayesian  inferences . 
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A  proof  of  the  last  assertion  will  now  be  sketched  in  the 
context  of  trinomial  sampling.  An  ordinary  prior  distribution  for 
an  unknown  _7r  is  represented  by  a  model  of  the  form  (X,  n,  |X,  r) 
where  r  is  single-valued  and  hence  no  ambiguity  is  allowed  in  the 
computed  probabilities.  Without  loss  of . generality ,  the  model 
(X,  II,  p.,  r)  may  be  specialized  to  (II,  n,  p.,  I)  where  I  is  the 
identity  mapping  and  p.  is  the  ordinary  prior  distribution  over  II. 
For  simplicity,  assume  that  p,  is  a  discrete  distribution  with 
probabilities  P^  »P£  » •  *  •  assigned  to  points  til  , . . .  ,jr^  in 
n.  From  (2.16)  it  follows  that  the  mapping  associated  with  a  product 
of  models  is  single-valued  if  the  mapping  associated  with  any  compo¬ 
nent  model  is  single-valued.  If  a  component  model  not  only  has  a 
single-valued  mapping,  but  has  a  discrete  measure  p.  as  well,  then 
the  product  model  is  easily  seen  to  reduce  to  another  discrete 
distribution  over  the  same  carriers  ir,  ,rn  , . . .  ,t r,  .  Indeed  the 
second  line  of  (2.16)  shows  that  the  product  model  assigns  proba¬ 
bilities  P(,2^)  to  which  are  proportional  to  p.t^,  where  l ^ 
is  the  probability  that  the  random  region  V  includes  the  point 

Setting  *  (7ril,7ri2 iir±3^  ^°^ows  f rom  t*ie  Properties 
of  the  random  region  V  that 

n-  n„  n,, 

(2.30)  l±  -  Tri;L  7 ri2  t ri3  , 

which  is  just  the  probability  that  all  of  the  independent  random 
regions  whose  intersection  is  V  include  Normalizing  the 
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product  model  as  indicated  in  (2.1)  or  (2.2)  leads  finally  to 


(2.31) 


p(Ji> 


P 


.1. 

1  1 


P^l  +  P2l2  + 


+  pd^d 


for  i  »  1,2 ,..  .  jd,.  which  is  the  standard  form  of  Bayes's  theorem. 

This  result  holds  for  any  choices  of  U-  ahd:  B  satisfying  (2.9).  Note 
that  l.  is  identical  with  the  likelihood  of  v. . 

i  ~i 

Generalized  Bayesian  inference  permits  the  use  of  sample  infor¬ 
mation  alone,  which  is  mathematically  equivalent  to  adopting  the 
informationless  prior  model  in  which  all  upper  probabilities  are 
unity  and  all  lower  probabilities  are  zero.  At  another  extreme,  it 
permits  the  incorporation  of  a  familiar  Bayesian  prior  distribution 
(if  it  is  a  genuine  distribution)  and  then  yields  the  familiar 
Bayesian  inferences.  Between  these  extremes  a  wide  range  of  flexi¬ 
bility  exists.  For  example,. a  prior  distribution  could  be  introduced 
for  the  coordinate  ir ^  alone,  while  making  no  prior  judgment  about 
the  ratio  Alternatively,  one  could  specify  prior  information 

to  be  the  same  as  that  contained  in  a  sample  of  size  m  which  produced 
nu  observations  in  category  c^  for  i  =  1,2,3.  In  the  analysis  of 
quite  small  samples,  it  would  be  reasonable  to  attempt  to  find  some 
characterization  of  prior  information  which  could  reflect  tolerably 
well  public  notions  about  ir.  In  large  samples ,  the  inferences  clearly 
resemble  Bayesian  inferences  and  are  insensitive  to  prior  information 
over  a  wide  range. 
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3.  A  SECOND  ILLUSTRATION 


Consider  a  sequence  of  independent  Bernoulli  trials  represented 


by  z^  with 


(3.1) 


PO^  -  1 1  Pi)  -  Pt  and  P(z.  «  0|pi)  -  1  -  p£, 
for  i  =  1,2 , . . . ,n, 


where  it  is  suspected  that  the  sequence  p^  is  subject  to  a  monotone 

upward  drift.  In  this  situation,  the  common  approach  to  a  sequence 

of  observations  z^  is  to  apply  a  test  of  the  null  hypothesis 

jp^  =  P2  =  ...  =  p  |  designed  to  be  sensitive  against  the  alternative 

hypothesis  |p^  <  p£  <  •••  <  Pn}*  The  unorthodox  approach  suggested 

here  is  to  compute  upper  and  lower  probability  inferences  for  the 

pair  of  symmetric  hypotheses  >,  P2  Pn}  and  {p^  <,  P2  <  •  •  •  £  P^ 

under  the  overall  prior  assumption  that  the  sequence  p^  is  monotone, 

either  increasing  or  decreasing,  with  probability  one.  A  small 

upper  probability  for  either  of  these  hypotheses  would  be  evidence 

for  drift  in  the  direction  contrary  to  that  indicated  by  the  hypothesis. 

Upper  and  lower  probabilities  may  also  be  computed  for  the  null 

hypothesis  {p-^  »  P2  =  ...  =  p^,  but  the  upper  probability  will 

usually  be  vanishingly  small  in  sample  sequences  of  moderate  length,, 

however  little  trend  is  apparent,  while  the  lower  probability  is 

always  zero. 

The  model  described  could  apply  in  simple  bioassays  or  learning 


SMi 
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situations.  A  wider  range  of  applications  could  be  achieved  in 
several  ways,  for  example  by  allowing  several  observations  at  each 
p^  or  postulating  Markov  type  dependence  in  the  z^  sequence.  But  the 
aim  here  is  to  focus  attention  as  simply  as  possible  on  one  feature  of 
the  new  methods,  namely  their  ability  to  handle  the  problem  of  many 
nuisance  parameters  which  plagues  the  more  traditional  forms  of 
statistical  inference.  Plausible  inferences  may  be  obtained  de¬ 
spite  the  presence  of  as  many  continuous  parameters  as  there  are 
dichotomous  observables. 

Under  the  binomial  analogue  of  the  trinomial  model  treated  in 
Section  2,  a  single  binomial  observable  z  is  represented  before 


observation  by  the  model 

(U,  Z  x  P,  p, 

B)  where 

(3.1) 

u  *  \ 

[u  ;  0  <  u  <  i} 

> 

(3.2) 

2  -  \ 

j^z  ;  z  *  0  or  z 

-1}- 

(3.3) 

p  H 

[v  ;  o  <  P  <  1} 

> 

p  is  the  uniform  distribution  over  U,  and 

(3.4)  Bu  «  j(z,p);  z  =  0  and  u  <  p  <  1,  or 

z  =  1  and  0  <  p  <  uj-  . 

After  conditioning  on  z,  this  model  becomes  effectively 
(U,  P,  p,  B  )  where 

Z 
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(3.5)  Bzu  =  -jp  ;  u  <  p  <  l}  if  z  =  0  , 

*  {p  ;  0  <  p  <  11}  if  z  =  1  . 

A  conditioned  model  of  this  kind  may  be  constructed  for  each  of 

n  independent  observations  z^  and  associated  parameters  p^. 

Combining  these  n  sources  of  information  about  P^»P2»,,*»Pn 

yields  a  single  model  (Un,  Pn,  pn,  B,  O  where 

'zl*z2*  *  *  * ,zn' 

(3.6)  Un  »  •[(u1,u2,...  ,un);  0  <  ^  <  1  for  i«l,2,...,n} 

(3.7)  P  ™  (p^  ,p2  >  •  •  •  »Pfl)  >  0  ^  P^  ^  13  1  >  •  •  •  jhj- 

pn  is  the  uniform  distribution  over  the  cube  U,  and 


(3.8)  B 


0vz2 . zn)(ul’U2”",Un) 


{(P1>P2>”->Pn);  P^  6  Bz  ui  for  1  ”  1  >2 > •  •  •  >n}  • 


•fl 


The  combined  model  would  be  appropriate  for  unrestricted  inferences 
about  an  unknown  (p^,p2 > • • • >Pn)  based  on  observations  (z^,z2 , . . . »?n) !• 
However,  when  consideration  is  restricted  to  the  subset  S  of  Pn 
in  which  p^,p2,.,.,pn  a  monotone  sequence,  the  sharpness  of  the 
inferences  is  much  improved. 

Define  and  T2  to  be  the  subsets  of  S  for  which 

Px  <  p2  <  ...  <  Pn  and  Pi  >  P2  >  ••*  2  Pn»  respectively.  Define 


vj,  OY  p  /i\ 


i 

» 


\> '  { 


^3 


FIGURE  5..  The  plotted  values  P1>P2,...,Pn  determine  a  point 
Pn  for  which  <  P£  <  •••  <  Pn«  The  plotted  values  u^,u2,...,u 
determine-  a  >point  of  Un  for  which  p1  lies  in  B.^,  p2  lies  in 
B0«2 *  p3  Ties  in  B^z^ , . . . ,pn  lies  in  BgZn  .  The  interpretation 
is  that  (u^ ,u2 , , , . ,un)  lies  in  the  region  T^ .  determined  by  the 


observation  z. 


1.  z„  »  0.  z_  »  1 


=  0. 
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=  T,  n  t  to  be  the  subset  of  S  for  which  p,  =  p9  =  ...  =  Pn- 
12  1  2  *-  £- 

*  *  * 

An  immediate  objective  is  to  characterize  T^,T 2  and  T^2  j  from 

whose  pn  measure  the  desired  inferences  will  follow.  For  example, 

T*  consists  of  all  points  (u^ ,u2 , . . . ,uq)  for  which  there  exists 

some  (p1,p2»... ,Pn)  satisfying  Px  <  P2  <  •••  <  Pn  and  such  that 

p.  lies  in  B  u. ,  for  i  -  l,2,...,n.  With  the  help  of  Figure  5 
2i  1 

it  is  easily  checked  that 

(3.9)  ^i  *  -^(u^ ,u2  * . . .  >un) »  u£  ^  uj 

V  (i,j)  )  z.  •  1,  z.  .  0,  i  <  j}  . 

By  symmetry, 

(3.10)  T2  *  { (ui»u2 » ‘ • »un> i  ui  >  uj 

•V  (i,j)  3  z±  «  0,  z.  *  1,  i  <  j}  . 

Finally , 

(3.11)  T*2  =  {(ux ,ti2  ,  —  ,un) ;  u.  <  u .  ¥  (i,j)  S  z±  -  1,  z.  -  o}. 

^  ^  ^  ^  ^  ^ 

It  is  clear  that  T u  and  that  T±2 ,  T][  -  T12  and 

T2  -  T12  are  disjoint  sets  whose  union  is  S  . 

Un  may  be  decomposed  into  ni  geometrically  similar  simplexes, 

each  characterized  by  a  particular  ordering  of  the  values  of  the 

coordinates  (u^ ,u2 , . . . ,un) .  These  simplexes  are  in  one-to-one 
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correspondence  with  the  permutations  (1 ,2  , . . .  ,n)~(l  ,2  , . . .  ,n  )  , 
where  for  every  (u^ ,U£ , . . . ,u^)  in  a  given  simplex  the  corre¬ 
sponding  permutation  obeys  u^  <  u2*  <  •  •  •  <  un*  •  Since  the 
characterizations  (3.9),  (3.10)  and  (3.11)  involve  only  order 
relations  among  coordinates  u^,  each  of  the  simplexes  is  either 

*  *  "k 

included  or  excluded  as  a  unit  from  or  T2  or  •  And  srnce 

each  of  the  n!  simplexes  has  pn  measure  1/nl  ,  the  pn 
Je  1e 

measures  of  or  T2  or  may  be  found  by  counting  the  appro¬ 

priate  number  of  simplexes  and  dividing  by  ni  .  Or,  instead  of 
counting  simplexes,  one  may  count  the  permutations  to  which  they 
correspond.  The  permutation  (1,2 ,. . . ,n)-(l  ,2  ,...,n  )  carries 
the  observed  sequence  (z^ ,z^ , . . . ,z^)  of  zeros  and  ones  into 

another  sequence  (z^^** . . .  >zn*)  of  zeros  and  ones.  According 

*  * 
to  the  definition  of  ,  a  simplex  is  contained  in  if  and 

only  if  its  corresponding  permutation  has  the  property  that 

i  <  j  for  all  i  <  j  such  that  z^  =  1  and  z =  0,  i.e.,  any 

pair  ordered  (1,0)  extracted  from  (z^ , . . . »z^)  must  retain  the 

same  order  in  the  permuted  sequence  (z^yZ^*, . . .  >zn*)  •  Similarly, 

to  satisfy  T2  any  pair  ordered  (0,1)  extracted  from  (z^>z2 , . . .  ,zn) 

must  have  its  order  reversed  in  the  permuted  sequence ,  while  to 
*  *  * 

satisfy  fl  T2  the  sequence  (z^*»z2*» . . .  ,2n*)  must  consist 

of  all  ones  followed  by  all  zeros. 

If  >z2 > • • • >zn)  contains  n^  ones  and  ^  zeros,  then  a 


simple  counting  of  permutations  yields 

(3-12)  p(Ti2)=Jirr“  • 

A  simple  iterative  procedure  for  computing  p  (T^)  or  P  (T^)  is 
derived  in  Appendix  II  by  Herbert  Weisberg.  The  result  is  quoted 
below  and  illustrated  on  a  numerical  example. 

For  a  given  sequence  of  observations  zi»z2»***  indefinite 
length  define  N(n)  to  be  the  number  of  permutations  of  the  re- 
stricted  type  counted  in  T^.  N(n)  may  be  decomposed  into 

r 

(3.13)  N(n)  -  ^  N(k,n) 

k=0 

where  N(k,n)  counts  the  subset  of  permutations  such  that 

. . .  >zn*)  has  k  zeros  preceding  the  rightmost  one. 

Since  no  zero  which  follows  the  rightmost  one  in  the  original 
sequence  , . . . ,2^)  can  be  permuted  to  the  left  of  any  one 

under  any  allowable  permutation,  the  upper  limit  r-  in  (3.13)  may 
be  taken  as  the  number  of  zeros  preceding  the  rightmost  one  in  the 
original  sequence  (z^,z^ , . . . ,z^) .  In  the  special  case  of  a  se¬ 
quence  consisting  entirely  of  zeros ,  all  of  the  zeros  will  be 
assumed  to  follow  the  rightmost  one  so  that  N(k,n)  =  0  for  k  >  0 
and  indeed  N(n)  *  N(0,n)  »  nl  .  Weisberg* s  iterative  formula  is 
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k-1 

(3.14)  N(k,n+1)  =  V  N(j,n) 

j=0 

+  (n^  +  1  +  k)  N(k,n) 

if  z  ,  =  1 
n+1 

*  ^  +  1  -  k)  N(k,n) 

if  z  =  0 
n+1 

where  and  ^  denote  as  above  the  numbers  of  ones  and  zeros , 

respectively,  in  (z^^  , . . .  ,zn) . 

Formula  (3.14)  has  the  pleasant  feature  that  the  counts  for 
the  sequences  (z^)  ,  (z^^)-,  (z^ ,2^  ,z^)  , . . .  may  be  built  up 
successively,  and  further  observations  may  be  easily  incorporated 
as  they  arrive.  Consider,  for  example,  the  hypothetical  observations 

(2-^ >^2 » . . « ,Zy)  =  (0,  0,  1,  1,  0,  1,  1)  . 

The  following  tabular  scheme  shows 

zn,  N(0,n),  ...  ,  N(r,n) 
on  line  n,  for  n  =  1,2,..., 7. 


-  40  - 


from  which  N(7)  =  1680.  The  number  of  permutations  consistent  with 
* 

is  found  by  applying  the  same  iterative  process  to  the  sequence 
(1,  1,  0,  0,  1,  0,  0)  with  zeros  and  ones  interchanged.  This  yields 


from  which  N(7)  ■  176.  The  number  of  permutations  common  to  T^ 
and  T*  is  3141  -  144.  Thus  pn(T*)  =  1680/71 ,  Pn(T*)  »  176/71 , 
pn(T12>  "  144  and  Pn^s*)  *  <1680  +  176 144)/71  -  1712/71  . 


Consequently,  the  upper  and  lower  probabilities  of  T^,  T2  and  T^ 
conditional  on  S  and  (z.^  ,z2 , . . .  ,Zj)  «  (0,  0,  1,  1,  0,  1,  1)  are 


. 

P  (T,) 


1680  p  /m  \  1536 

1712  *  *kV  1712  * 


)  m  P  (T  )  »  - 

r  u2'  1712  *  r*u2'  1712  * 


?  *T12*  "  1712  ’  P*^T12^  “  0  * 


/ 


-  41  - 


Since  more  than  10  percent  of  the  measure  could  apply  to  a  monotone 
nonincreasing  sequence,  the  evidence  for  an  increasing  sequence  is 

not  compelling. 

For  the  extended  sequence  of  observations  0,  0,  1,  1,  0, 

1,  1,  0,  1,  1,  1,...,  the  lower  and  upper  probabilities  of  a  monotone 

downward  sequence  after  n  observations  are  as  follows  : 


p*(t2> 


*  . 
P  (t2) 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 


0. 

0. 

0. 

0. 

.167 

.048 

.019 

.188 

.065 

.028 

.014 


1. 

1. 

.333 

.167 

.417 

.190 

.103 

.319 

.148 

.080 

.047 
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4.  COMMENTS  ON  THE  METHOD  OF  GENERATING 
UPPER  AND  LOWER  PROBABILITIES 

Although  often  notationally  convenient,  it  is  unnecessary  to 
use  models  (X,  S,  |x,  T)  outside  of  the  subclass  where  the  inverse 
of  r  is  single -valued.  For  the  model  (X,  S,  [x,  F)  with 

(4.1)  S  =  X  x  S 
and 

(4.2)  f  x  =  X  x  Fx 

does  belong  to  the  stated  subclass,  ant'  yields 

(4.3)  (P*(T),  P*(T))  =  (P*(XxT) ,  P*(XxT)) 

for  any  T^S,  where  the  left  side  of  (4.3)  refers  to  any  original 
model  (X,  S,  [X,  T)  and  the  right  side  refers  to  the  corresponding 

■V  V  ,*W  V 

model  (X,  S,  [x,  F) .  Moreover,  the  model  (X,  S,  ix,  r)  provides 
upper  and  lower  probabilities  for  all  subsets  of  X  x  S,  not  just 
those  of  the  form  X  x  T.  On  the  other  hand,  it  was  assumed  in 
applying  the  original  form  (X,  S,  M-,  F)  that  the  outcome  x  in 
X  is  conceptually  unobservable,  so  that  no  operational  loss  is 
incurred  by  the  restriction  to  subsets  of  the  form  X  x  TCS. 

Underlying  the  formalism  of  (X,  S,  [x,  F)  or  its  equivalent 

v'  n/ 

(X,  S,„ix,  r)  is  the  idea  of  a  probability  model  which  assigns  a 

•  «  \ 

distribution  only  over  a  partition  of  a  complete  sample  space, 
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specifically  the  distribution  [i,  over  the  partition  of  S  =  X  x  S 
defined  by  X.  Thus  the  global  probability  law  of  an  ordinary 
probability  measure  space  is  replaced  by  a  marginal  distribution 
or  what  might  be  called  a  partial  probability  law.  The  aiir  there¬ 
fore  is  to  establish  a  useful  probability  calculus  on  marginal  or 
partial  assumptions. 

I  believe  that  the  most  serious  challenges  to  applications  of 
the  new  calculus  will  come  not  from  criticism  of  the  logic  but  from 
the  strong  form  of  ignorance  which  is  necessarily  built  into  less- 
than-global  probability  laws.  To  illustrate,  consider  a  simple 
example  where  w^  denotes  a  measured  weight’,  W£  denotes  a  true 
weight,  and  x  »  w^  -  W2  denotes  a  measurement  error.  Assume  that 
ample  relevant  experience  is  available  to  justify  assigning  a 
specific  error  distribution  p.  over  the  space  X  of  possible  values 
of  x,  ?Che  situation  may  be  represented  by  the  model  (X,  W,  p,,  T) 
with  X  and  p.  as  defined,  with  W  =  {(w^.^);  w^  >  0,  W£  >  o}  , 
and  T  defined  by  the  relation  x  =  w^  -  w0 .  Conditioning  the 
model  on  an  observed  w^  leaves  one  with  the  same  measure  p.  applied 
to  w^  -  w2  ,  except  for  renormalization  which  restricts  the  measure 
to  wL  >  0.  The  result  is  very  much  in  the  spirit  of  the  fiducial 
argument  (although  there  is  some  doubt  about  Fisher's  attitude  to 
renormalization) .  I  am  unable  to  fault  the  logic  of  this  fiducial- 
like  argument.  Rather  any  discomfort  is  produced  by  distrust  of  the 
initial  model,  in  particular  by  its  implication  that  every  uncertain 
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event  governed  by  the  true  weight  W£  has  initial  upper  and  lower 
probabilities  one  and  zero.  It  would  be  hard  to  escape  a  feeling 
in  most  real  situations  that  a  good  bit  of  information  about  a 
parameter  is  available,  even  if  difficult  to  formalize  objectively, 
and  that  such  information  should  clearly  alter  the  fiducial- like 
inference  if  it  could  be  incorporated.  One  way  to  treat  this  weak¬ 
ness  is  openly  to  eschew  the  use  of  prior  information,  while  not 
necessarily  denying  its  existence,  that  is,  to  assert  that  the 
statistician  should  summarize  only  that  information  which  relies 
on  the  observation  and  the  objectively  based  error  distribution  p,-. 

Because  of  the  conservatism  implicit  in  the  definition  of  upper 
and  lower  probabilities,  the  approach  of  rejecting  soft  information 
seems  likely  to  provide  conservative  inferences  on  an  average,  but 
I  have  not  proved  theorems  to  this  effect.  The  difficulty  is  that 
the  rejection  of  all  soft  information,  including  even  information 
about  parametric  forms,  may  lead  to  unrealistically  weak  inferences. 

The  alternative  approach  is  to  promote  vague  information  into  as 
precise  a  model  as  one  dares ,  and  combine  it  in  the  usual  way  with 
sample  information. 

Some  comments  on  the  mathematics  of  upper  and  lower  proba¬ 
bilities  are  appropriate.  A  very  general  scheme  for  assigning  upper 
and  lower  probabilities  to  the  subsets  of  a  sample  space  S  is  to 
define  a  family  C  of  measures  P  over  S  and  to  set 
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(4.4) 


P*(T)  =  sup  P(T) 

C 

and  P*(T)  =  inf  P(T)  . 


Within  the  class  of  systems  of  upper  and  lower  probabilities  achieved 
in  this  way  for  different  Q  ,  there  is  a  hierarchical  scheme  of 
shrinking  subclasses  ending  with  the  class  of  systems  defined  by 
models  like  (X,  S,  p,,  T).  See  Dempster  (1967a).  The  family  G 
corresponding  to  a  given  (X,  S,  p. ,  T)  consists  of  all  measures 
P  which  for  each  x  distribute  the  probability  element  dp.(x) 
in  some  way  over  Tx.  Some  readers  may  feel  that  all  systems  should 
be  allowed,  not  just  the  subclass  of  this  paper.  In  doing  so,- how¬ 
ever,  one  loses  the  conception  of  a  source  of  information  as  being 
a  single  probability  measure.  For,  in  the  unrestricted  formulation 
of  (4.4)  the  class  £  consists  of  conceptually  distinct  measures 
such  as  might  be  adopted  by  a  corresponding  class  of  personalist 
statisticians,  and  the  conservatism  in  the  bounds  of  (4.4)  amounts 
to  an  attempt  to  please  both  extremes  in  the  class  of  personalist 
statisticians.  I  believe  that  the  symmetry  arguments  underlying 
probability  assignments  do  not  often  suggest  hypothetical  families 
C  demanding  simultaneous  satisfaction.  Also,  the  rules  of  con¬ 
ditioning  and,  more  generally,  of  combination  of  independent  sources 
of  information  do  not  extend  to  the  unrestricted  system  (4.4),  and 
without  these  rules  the  spirit  of  the  present  approach  is  lost. 


The  aim  of  this  short  section  has  been  to  suggest  that  upper 
and  lower  probabilities  generated  by  multivalued  mappings  provide 
a  flexible  means  of  characterizing  limited  amounts  of  information. 
They  do  not  solve  the  difficult  problems  of  what  information  should 
be  used,  and  of  what  model  appropriately  represents  that  information. 
They  do  not  provide  the  only  way  to  discuss  meaningful  upper  and 
lower  probabilities.  But  they  do  provide  an  approach  With  a  well 
rounded  logical  structure  which  applies  naturally  in  the  statistical 
context  of  drawing  inferences  from  samples  to  populations. 
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5.  COMMENTS  ON  THE  MODELS  USED  FOR  INFERENCE 

The  models  used  here  for  the  representation  of  sampling  from 
a  population  take  as  their  point  of  departure  a  space  whose 
elements  correspond  to  the  members  of  the  population.  In  addition 
to  the  complex  of  observable  characteristics  usually  postulated 
in  mathematical  statistics,  each  population  member  is  given  an 
individual  identity.  In  conventional  mathematical  statistics 
the  term  hypothesis  is  often  used  for  an  unknown  population 
distribution  of  observable  characteristics,  but  the  presence  of 
the  population  space  in  the  model  leads  directly  to  the  more 
fundamental  question  of  how  each  hypothesized  population  distri¬ 
bution  applies  to  the  elements  of  the  population  space,  i.e.,  under 
a  given  hypothesis  what  are  the  observable  characteristics  of  each 
population  member?  In  the  trinomial  sampling  model  of  Section  2 , 
the  question  is  answered  by  the  multivalued  mapping  B  defined  in 
(2.7).  As  illustrated  in  Figure  1,  B  asserts  that  for  each  hypothesis 
7r  the  population  space  U  partitions  into  three  regions  U^,  ^ 
corresponding  to  the  observable  characteristics  c^,  ,  c^.  More 

generally,  the  observable  characteristics  may  be  multinomial  with 
k  categories  c^,  »  •••»  and  the  population  space  U  may 
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be  any  space  with  an  associated  random  sampling  measure  p.  For 
a  given  hypothesis  £  =  (ir^  ,7^  , . . .  ,77^)  the  question  is  answered 
by  determining  subsets  *  *  *  * *^k  of  U  which  specify  that 

a  population  member  in  IL  is  permitted  to  have  characteristic 
c^  under  7 r,  for  i  =  l,2,...,k.  Having  reached  this  point  in 
building  the  model,  it  seems  reasonable  to  pose  the  restriction 
which  generalizes  (2.9),  namely, 

(5.1)  p(U£)  =  7^  and  p^L  n  U.)  =  0  ■ 

for  i,j  =  l,2,...,k  and  i  ^  j.  The  reason  for  (5.1)  as  with 
,(2.9)  is  simply  to  have  v ^  represent  both  upper  and  lower  prob¬ 
abilities  of  ^  for  a  single  drawing  with  a  given  w. 

Now  it  is  evident  that  the  above  formulation  by  no  means 
uniquely  determines  a  model  for  multinomial  sampling.  '  Indeed, 
one  may  start  from  any  continuous  space  U  with  measure  p,  and 
for  each  jr  specify  a  partition  satisfying  (5.1) 

but  otherwise  completely  arbitrary.  In  other  words  there  is  a 
huge  bundle  of  available  models.  In  Dempster  (1966),  two  choices 
were  offered  which  I  called  models  of  the  first  kind  and  models  of 
the  second  kind.  The  former  assumes  that  the  multinomial  categories 
c^,C2»...,c^  have  a  meaningful  order,  and  is  uniquely  determined  by 
the  assumption  that  the  population  members  have  an  order  consistent 
with  the  order  of  their  observable  characteristics  under  any 
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hypothesis  ir.  See  Dempster  (1967b) .  The  restriction  to  ordered 
categories  implies  essentially  a  univariate  characteristic,  and 
because  that  restriction  is  so  severe  the  following  discussion  is 
mostly  aimed  at  a  general  multinomial  situation  with  no  mathematical 
structure  assumed  on  the  space  of  k  categories.  The  general  model 
of  the  second  kind  is  defined  by  extending  (2.5),  (2.6)  and  (2.7) 
in  the  obvious  way  from  k  *  3  to  general  k.  This  model  treats 

the  k  categories  with  complete  symmetry,  but  it  is  not  the  only 

model  to  do  so,  for  one  can  define  arbitrarily  for  7r  such 

that  7r«  <  ir0  <  ...  <  7r.  ,  and  define  B**^  for  other  ir  by  symmetry. 

But  the  general  model  of  the  second  kind  is  strikingly  simple,  and 
I  recommend  it  because  I  can  find  no  competitor  .  with  comparable 
aesthetic  appeal. 

The  status  of  generalized  Bayesian  inference  resembles  that 
of  Bayesian  inference  in  the  time  of  Bayes,  by  which  I  mean  that 
Bayes  must  have  adopted  a  uniform  prior  distribution  because  no 
aesthetically  acceptable  competitor  came  to  mind.  The  analogy 
should  be  carried  further,  for  even  the  principles  by  which  com¬ 
petitors  should  be  judged  were  not  formulated  by  Bayes,  nor  have 
the  required  principles  been  well  formulated  for  the  models  discussed 
here.  I  believe  that  the  principles  required  by  the  two  situations 
are  not  at  all  analogous,  for  the  nature  and  meaning  of  a  prior 
distribution  has  become  quite  clear  over  the  last  two  centuries 


and  the  concept  may  be  carried  more  or  less  whole  over  to  generalized 
Bayesian  inference.  The  choice  of  a  model  satisfying  (5.1),  on  the 

other  hand,  has  no  obvious  connection  with  prior  information  as  the 

* 

term  is  commonly  applied  relative  to  information  about  postulated 

unknowns.  In  the  case  of  generalized  Bayesian  inference,  I  believe 

the  principles  for  choosing  a  model  to  be  closely  involved  with  an 

uncertainty  principle  which  can  be  stated  loosely  as: 

The  more  information  which  one  extracts  from  each 
sample  individual  in  the  form  of  observable 
characteristics,  the  less  information  about  any 
given  aspect  of  the  population  distribution  may 
be  obtained  from  a  random  sample  of  fixed  size. 

For  example,  a  random  sample  of  size  n  *  1000  from  a  binomial 

population  yields  quite  precise  and  nearly  objective  inferences 

about  the  single  binomial  parameter  p  involved.  On  the  other 

hand,  if  a  questionnaire  given  to  a  sample  of  n  *  1000  has  been 

sufficient  to  identify  each  individual  with  one  of  1,000,000 

categories,  then  it  may  be  foolhardy  to  put  much  stock  in  the  sample 

information  about  a  binomial  p  chosen  arbitrarily  from  among  the 

21,000,000  _  2  nontrivial  available  possibilities.  At  least 

conceptually,  most  real  binomial  situations  are  of  the  latter  kind, 

for  a  single  binomial  categorization  can  be  achieved  only  at  the 

expense  of  suppressing  a  large  amount  of  observable  information 

about  each  sample  individual.  The  uncertainty  principle  is  therefore 

a  specific  instance  of  the  general  scientific  truism  that  an 
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investigator  must  carefully  delimit  and  specify  his  area  of 
investigation  if  he  is  to  learn  anything  precise. 

Generalized  Bayesian  inference  makes  possible  precise  formula¬ 
tions  of  the  uncertainty  principle.  For  example,  the  model  of 
the  second  kind  k  *  2  and  n  =  1000  yields  inferences  which  most 
statisticians  would  find  very  nearly  acceptable  for  binomial 
sampling.  On  the  other  hand,  it  is  a  plausible  conjecture  that 
the  model  of  the  second  kind  with  k  =  1,000,000  and  n  **  1000 
would  yield  widely  separated  upper  and  lower  probabilities  for 
most  events.  The  high  degree  of  uncertainty  in  each  inference 
compensates  for  the  presence  of  a  large  number  of  nuisance  para¬ 
meters  ,  and  protects  the  user  against  selection,  effects  which  would 
produce  many  spurious  inferences.  Use  of  the  model  of  the  first 
kind  with  k  =  1,000,000  and  n  «  1000  would  very  likely  lead'to 
closer  bounds  than  the  model  of  the  second  kind  for  binomial 
inferences  relating  to  population  splits  in  accord  with  the  given 
order  of  population  members .  And  it  is  heuristically  clear  that 
models  could  be  constructed  which  for  each  jr  would  place  each  point 
of  U  in  each  of  as  £  varies  over  an  arbitrarily  small 

neighborhood  about  jr.  Such  a  model  would  present  an  extreme  of 
uncertainty,  for  all  upper  and  lower  probability  inferences  would 
turn  out  to  be  one  and  zero,  respectively.  It  is  suggested  here 
that  the  choice  of  a  model  can  only  be  made  with  some  understanding 
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of  the  specific  reflections  of  the  uncertainty  principle  which 
it  provides.  For  the  time  being,  I  judge  that  the  important  task 
is  to  learn  more  about  the  inferences  yielded  by  the  aesthetically 
pleasing  models  of  the  second  kind.  Eventually,  intuition  and 
experience  may  suggest  a  broader  range  of  plausible  models. 

Models  of  the  second  kind  were  introduced  above  for  sampling 
from  a  general  multinomial  population  with  k  categories  and 
unknown  1  x  k  parameter  vector  jr.  But  the  range  of  application  of 
these  models  is  much  wider.  First,  one  may  restrict  to  parametric 
hypotheses  of  the  general  form  jr  «  ir(0 ,  4>,  ...).  More  important, 
the  multinomial  may  be  allowed  to  have  an  infinite  number  of 
categories,  as  explained  in  Dempster  (1966),  so  that  general  spaces 
of  discrete  and  continuous  observable  characteristics  are  permissible. 
It  is  possible  therefore  to  handle  the  standard  parametric 
hypotheses  of  mathematical  statistics.  Very  few  of  these  have  as 
yet  proved  analytically  trtctable. 

At  present,  mainly  qualitative  insights  are  available  into 
the  overview  of  statistical  inference  which  the  sampling  models  of 
generalized  Bayesian  inference  make  possible.  Some  of  these  insights 
have  been  mentioned  above,  such  as  the  symmetric  handling  of  prior 
and  sample  information,  and  the  uncertainty  principle  by  which 
upper  and  lower  probabilities  reflect  the  degree  of  confusion 
produced  by  small  samples  from  complex  situations.  It  is  interesting 
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to  note  also  that  parametric  hypotheses  and  prior  distributions, 

which  are  viewed  as  quite  different  in  conventional  statistical 

theory,  play  indistinguishable  roles  in  the  logical  machinery  of 

generalized  Bayesian  inference.  For  a  parametric  hypothesis  such 

as  7 r  »  7r(0 ,  <J),  ...)  may  be  represented  by  a  model  of  the  general 

type  (X,  S,  jx,  T)  which  assigns  all  of  its  probability  ambiguously 

oyer  the  subset  of  7T  allowed  by  „7r(0  ,  (j>,  ...)  as  0,  ,  ...  range 

over  theii  permitted  values  and  this  model  combines  naturally 

with  sample  information  using  the  rule  of  combination  defined  in 

Section  2  and  suggested  there  to  be  appropriate  for  the  introduction 

of  prior  information.  ' 

Concepts  which  appear  in  standard  theories  of  inference  may 

reappear  with  altered  roles  in  generalized  Bayesian  inference. 

Likelihood- is  a  prime  example.  The  ordinary  likelihood  function 

L (7r)  based  on  a  sample  from  a  general  multinomial  population  is 

proportional  to  the  upper  probability  of  the  hypothesis  7 r.  This 

may  be  verified  in  the  trinomial  example  of  Section  2  by  checking 

that  the  random  region  illustrated  in  Figure  4  covers  the  point  jr 

nx  1*2 

with  probability  tt^  tj^  .  The  general  result  is  hardly  more 
difficult  to  prove.  Now  the  upper  probability  of  tt  for  all  #  does 
not  contain  all  the  sample  information  under  generalized  Bayesian 
inference.  Thus  the  likelihood  principle  fails  in  general,  and 
the  usual  sets  of  sufficient  statistics  under  exponential  families 


of  parametric  hypotheses  no  longer  contain  all  of  the  sample 
information.  The  exception  occurs  in  the  special  case  of  ordinary 
Bayesian  inference  with  an  ordinary  prior  distribution,  as 
illustrated  in  (2.31).  Thus  the  failure' of  the  likelihood 
principle  is  associated  with  the  uncertainty  which  enters  when 
upper  and  lower  probabilities  differ.  In  passing,  note  that 
marginal  likelihoods  are  defined  in  the  general  system,  i.e.,  the 
upper  probabilities  of  specific  values  of  6  from  a  set  of  parameters 
0,  4> ,  are  well  defined  and  yield  a  function  L(0)  which  may  be 
called  the  marginal  likelihood  of  0  alone.  If  the  prior  information 
consists  of  an  ordinary  prior  distribution  of  0  alone,  with  no 
prior  information  about  the  nuisance  parameters,  then  L(0)  contains 
all  of  the  sample  information  about  0 . 

Unlike  frequency  methods ,  which  relate  to  sequences  of  trials 
rather  than  to  specific  questions,  the  generalized  Bayesian 
inference  framework  permits  direct  answers  to  specific  questions  in 
the  form  of  probability  inferences.  I  find  that  significance  tests 
are  inherently  awkward  and  unsatisfying  for  questions  like  that 
posed  in  the  example  of  Section  4,  and  the  main  reason  that  Bayesian 
inference  has  not  replaced  most  frequency  procedures  has  been  the 
stringent  requirement  of  a  precise  prior  distribution.  I  hope  that 
I  have  helped  to  reduce  the  stringency  of  that  requirement. 
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APPENDIX  I 

A  derivation  is  sketched  here  for  the  distributions  (2.23) 
relating  to  specific  vertices  of  the  random  region  R  defined  by 
(2.20).  R  is  the  intersection  of  n  regions  for 

i  =  l,2,...,n,  as  illustrated  in  Figure  4.  The  region  B^u,^ 
corresponding  to  which  gives  rise  to  an  observation  c^ 
consists  of  points  ju  such  that  u^/u^  <  u^Vuj^  and  ^'/u^  < 

The  intersection  of  the  n^  regions  corresponding  to  the  n^  obser¬ 
vations  c^  is  a  region  R^  consisting  of  points  u  such  that 


(1.1) 


u3/ul  *  C13  and  U2/ul  ^  c12 


where  *  mintu^Vu^]  and  *  minEu^Vu^] ,  the  minimization 
being  over  the  subset  of  i  corresponding  to  observations  c^.  Note 
that  R^  together  with  the  n^  regions  which  define  it  are  all  of  the 
type  pictured  on  level  1  of  Figure  2 .  By  permuting  subscripts , 
define  the  analogous  regions  R£  with  coordinates  C2g»  c.^  and  R^ 
with  coordinates  c^j  »  where  R2  and  R^  are  of  the  types  . 
pictured  on  levels  2  and  3  of  Figure  2,  respectively.  One  is  led 


thus  to  the  representation 


(1.2) 


r  =  rx  n  ^  n  r3  . 


Any  particular  instance  of  the  region  R  which  contains  at  least 
one  point  is  a  closed  polygon  whose  sides  are  characterized  by  fixed 
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ratios  of  pairs  of  coordinates  u^,  u ^ .  Thus  R  may  be  described 
by  a  set  of  six  coordinates 


(1.3) 


b. .  *  max[u./u. ] 
ueR  3  1 


for  i  £  j.  From  (1.1),  (1.2),  and  (1.3)  it  follows  that 


(1.4) 


b. .  <  c. . 


for  i  #4  j .  Moreover,  equality  holds  if  the  corresponding  side  of 
R^  is  also  a  side  of  R,  while  the  inequality  is  strict  if  the  side 
of  R^  misses  R  entirely.  The  reader  may  wish  to  satisfy  himself 
that  R  may  have  3,  4,  5,  or  6  sides  in  which  case  the  strict 
inequality  in  (1.4)  holds  for  3,  4,  5,  or  6  pairs  i,j,  (with 
probability  one). 

If  R  is  considered  a  random  region,  while  R^  is  a  fixed  region 
of  the  same  type  with  coordinates  b?^ ,  then 

(1.5)  Pr(RP  R°)  -  Pr(bjj  2  b°j  V  i  *  j) 

...  ,0  ,  ,0  ”nl.,_  ,  .0  ,  .  0  v~n2  /n  ,  .  0  ,  0  x”n3 

-  a  +  b12  +  b13)  (i  +  b21  +  b23>  (i  +  b31  +  b32> 


0 


To  prove  (1.5)  note  first  that  the  three  events  (b^  >  b.^,* 


b13  ^  b13i 


b2l  ^  b21*  b23  ^  b 


23^  '  ib31  -  b31’  b32  ^  b32  j"  are 

equivalent  respectively  to  the  three  events  /c^  >  b!^ »  >  b!j^  > 


(c 


0 


lc21  ^  b21’  C23 


HI  JL+J  J.-'j 

:31  “  b31’  c32  ^  b32 [  *  In  tbe  latter  form> 


-  58  - 


the  three  events  are  clearly  independent,  for  they  depend  on 
disjoint  sets  of  independent  u^,  and  their  three  probabilities 


are  the  three  factors  in  (1.5).  For  example,  the  first  event 

/  #  \ 

says  that  the  n^  points  u/  '  corresponding  to  observations  c^ 
fall  in  the  subtriangle  >  b^  and  u^/u^  >  b^  whose  area 

is  the  fraction  (1  +  b“2  +  b^)  ^  of  the  area  of  the  whole 
triangle  U. 


It  will  be  convenient  to  denote  the  right  side  of  (1.5)  by 

F(b^»  b2i»  b23*  ^31*  b32^  defines  as  the  b?^  vary  a 

form  of  the  joint  cumulative  distribution  function  of  the  b... 

This  c.d.f.  should  be  handled  with  care..  First,  it  is  defined  only 
over  the  subset  of  the  positive  orthant  in  six  dimensions  such  that 
the  b?.  define  a  nonempty  R^.  Many  points  in  the  orthant  are 
ruled  out  by  relations  like  b^  <  b!j^  b^  which  are  implicit  in 
(1.3))  Second,  the  distribution  of  the  b_  is  not  absolutely 
continuous  over  its  six-dimensional  domain,  but  assigns  finite 
probability  to  various  boundary  curved  surfaces  of  dimensions  5, 

•  4  and  3,  corresponding  to  random  R  with  5,  4  and  3  sides.  Never¬ 
theless  it  is  not  difficult  to  deduce  (2.23)  from  (1.5). 

Suppose  that  u*  denotes  the  vertex  of  R  with  maximum  first 
coordinate.  This  vertex  lies,  with  probability  one,  at  the  inter¬ 


section  of  two  of  the  six  sides  of  R^,  R^  and  R^ .  By  looking  at  the 
vertices  defined  by  all  possible  pairs  of  sides  it  is  easily  checked 
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that  exactly  three  possibilities  exist  for  u;v,  namely. 


(i) 

ul/u!  =  C21 

and 

Ul/U3  =  C31> 

(ii) 

u«/u|  =  c23 

and 

u'l’/u3  =  C31> 

(iii) 

u*/uj  -  c21 

and 

aPut  -  c32  ' 

The  probability  density  function  of  \g'(  may  be  formed  by  summing 
the  contributions  from  the  three  possibilities  (i),  (ii) ,  (iii) . 
The  contribution  from  case  (i)  will  be  expressed  first  in  terms  of 
c2i>  and  then  transformed  to  u*,  u?jf.  Consider  the  event  E 

that  both  jb^  <  <  ^21  +  ^31  ^  °31  ^  ^*31  +  an{*  t^iat  th© 

lines  C2^  and  c^  intersect  in  a  point  which  maximizes  the  first 
coordinate.  The  latter  condition  may  be  written 


{1.1)  |cl2  >  v2/v1,  c13  >  v3/v1,  c23  >  v3/v2,  c32  >  v2/v3j 

where  v,  *  (v.^>  V2 ,  v3)  is  the  point  at  which  the  lines  c3^  and  c^ 
intersect,  or 


(1.3) 


-  -1  v  -1  .  -1  .  -1 
C12  ^  21*  c13  -  c31»  c23  ^  c21c31*  c32  ^  c21c31| 


Thus ,  apart  from  terms  of  second  order  and  higher  in  6  and  e , 


<Vy 
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(1.9)  Pr(E)  * 


F< [b°L  +  e)"1,  [b^t  6]*1,  +'€,  [b°x  +  eHb^  +  6]'1, 


b^  +  6,  Eb^i-  +  e]  E b>2i  F  &E) 

-  F«b21  +  erl>  ^r1.  b21  +  e>  [b2°i  + 

b31>  [b2°i  +  ^'SV 

•w-iO  **  1  r«»0  ,  c  *]  *1  i  0  ■u®r,u®  _i_ci  *0  i  c 

•  F(b2i  y  Lt>3i  +  6J  ,  ^21>  ^21  ^31  +  ^  *  ^31  + 


b21_1tb31  +  6]> 


,  WK0  -1  0  -1  ,0  0  ,0  -1  0  0  -1  0  ) 

+  F(b21  ,  b31  ,  b2l,  b2lb31  ,  b31,  b21  b^- 

That  is,  the  required  case  (i)  contribution  is  found  in  terms  of 
c2i>  c31  represented  by  b^,  b^  by  differentiating  F  with  respect 
to  its  third  and  fifth  arguments  and  then  substituting  b^  \  b^  ^ , 

b21^31  ^ *  ^21  ^31  or^er  f°r  tbe  other  four  arguments.  Expressing 

the  result  in  terms  of  the  coordinates  ij,  35  (u.  ,  u„ ,  u„)  at  which  the. 

0  0  nl  n2+1  n3+1 

lines  b2^  and  b^  intersect,  one  finds  n2n3  u^  u2  u3  which, 

00  -2  -2 

after  multiplying  by  3(u^,u2)/3(b2l,b3^)  ■  u^u2  u3  gives  the 
density  contribution 


(1.10) 


n^+1  n2-l  n3~l 
n2n3  U1  u2  u3 


expressed  in  terms  of  u^>u2  anc*  of  course  u3  =  l-u^-u2»  The 

contributions  from  cases  (ii)  and  (iii)  may  be  found  similarly  to  be 

n.  n2»l  n3  n  n2  n  -1 

(1.11)  n2n3  u^  1  u2  u3  and  n2n3  1  u2  u3 
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Since  +  u2  +  u^  =  1,  the  sum  of  the  three  parts  is 


nx  n2-l  n^-l 
n2n3  U1  u2  u3  *  or 


n.  ln0 ln0 1  r  ,  n, 

/T  i  o  \  12  3  _ nl _  .  1 

*  n‘.  Ln^ l  (n2 -1)  1  (n^-l) U1  u 


n2-l  n3-l 
2  u3 


]> 


where  the  first  term  is  the  probability  that  u*  is  anywhere,  i.e., 
that  R  is  not  empty,  while  the  second  .is  the  Dirichlet  density 
given  in  (2.23). 

The  density  of  the  point  with  minimum  first  coordinate  may 
be  found  by  a  similar  argument.  The  analogue  of  (1.6)  is 


(1.13) 


(i) 

U|/U*  -  c12 

and 

u!/ul  =  c13 

(ii) 

uJ/“f  ■  c32 

and 

u3/u*  ■  c13 

(iii) 

uJ/ul  *  c12 

and 

u*/U|  -  c23 

and  the 
(1.14) 


corresponding  three  components  of 


n^-l  n2  n3 

nl (ni+1)ui  u2  u3  »  nin3 


nl"1  n2 
nln2  U1  u2 


density  turn  out  to  be 

n  -1  n?  n 
Ul  u2  u3  *  and 


which  sum  to 


(1.15) 


n^ln^n^l 

nl 


[ 


(n^-1) ln2 In^l 


which  like  (1.12)  is  the  product  of  the  probability  that  R  is  not 
empty  and  the  Dirichlet  density  specified  in  (2,23), 
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The  remaining  four  lines  of  (2.23)  follow  by  symmetry.  The 
probability  that  R  is  not  empty  may  be  obtained  directly  by  an 
argument  whose  gist  is  that,  for 4 any  set  of  n  points  in  U,  tin  •  * 
is  exactly  one  way  to  assign  them  to  three  cells  of  sizes  n^,  n2» 
n^  corresponding  to  observations  c^ ,  C2  >  c^  in  such  a  way  that  R 
is  not  empty.  This  latter  assertion  will  not  be  proved  here. 
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APPENDIX  II  (By  H.  Weisberg) 

In  this  appendix  we  derive  formula  (3.14).  We  wish  to  find 
the  number  of  permutations  (1,2 , . . . ,n)-(l  ,2  ,...,n  ),  carrying 
an  original  sequence  (z^,..,,z  )  of  zeros  and  ones  into  a  sequence 
(z^,...,zn^)  of  zeros  and  ones  subject  to  the  restriction  that 

•Jlf  ^ 

i  <  j  for  all  i  <  j  such  that  z^  *  1  and  z^  =  0.  Let  us 
call  such  permutations  admissible.  The  approach  is  to  count 
admissible  permutations  of  (z^,...,zn)  for  which  there  are  k 
zeros  preceding  the  right-most  one.  Call  such  permutations 
k-permutations . 

First  note  that  each  admissible  permutation  of  (z^, . . . >zn+^) 
may  be  considered  as  an  admissible  permutation  of  (z^,...,zn)  with 
zr+^  added  on  in  a  position  which  maintains  the  admissibility. 

Suppose  first  that  zn+^  =  1.  Corresponding  to  each  admissible 
k-permutation  of  (z^,...,zn)  we  can  form 

(oj+14k)  admissible  k-permutations  of  (z^, . . . >zn+^) 

1  admissible  i-permutation  of  (z^,...,z_^) 
for  i  -  k  +  1, . . . jOj  . 

Summing  over  ail  admissible  permutations- of  (z^,...,zn)  yields 
the  first  part  of  (3.14). 

Now  suppose  zn+i  "  0.  Corresponding  to  each  admissible 
k-permutation  of  (z^,. . . ,z^) ,  there  are  +  1  -  k  admissible 
permutations  of  (z^ , , .  ,zn+^) ,  since  zn+i  cannot  cross  to  the 
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